Cloudera Manager Health Checks

Transcription

1 Cloudera, Inc. 220 Portage Avenue Palo Alto, CA US: Intl: Cloudera Manager Health Checks

2 Important Notice Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names or slogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property. The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document. Version: 4.6 Date: August 8, 2013

3 Contents ACTIVITY MONITOR ACTIVITY MONITOR PIPELINE... 1 ACTIVITY MONITOR ACTIVITY TREE PIPELINE... 1 ACTIVITY MONITOR FILE DESCRIPTOR... 2 ACTIVITY MONITOR HOST HEALTH... 3 ACTIVITY MONITOR LOG DIRECTORY FREE SPACE... 3 ACTIVITY MONITOR CLOUDERA MANAGER AGENT HEALTH... 4 ACTIVITY MONITOR UNEXPECTED EXITS... 5 ACTIVITY MONITOR WEB METRIC COLLECTION... 5 FLUME AGENT FILE DESCRIPTOR... 6 FLUME AGENT HOST HEALTH... 6 FLUME AGENT LOG DIRECTORY FREE SPACE... 7 FLUME AGENT CLOUDERA MANAGER AGENT HEALTH... 8 FLUME AGENT UNEXPECTED EXITS... 8 ALERT PUBLISHER FILE DESCRIPTOR... 9 ALERT PUBLISHER HOST HEALTH... 9 ALERT PUBLISHER LOG DIRECTORY FREE SPACE ALERT PUBLISHER CLOUDERA MANAGER AGENT HEALTH ALERT PUBLISHER UNEXPECTED EXITS DATANODE BLOCK COUNT DATANODE CONNECTIVITY DATANODE FILE DESCRIPTOR DATANODE FREE SPACE REMAINING DATANODE GARBAGE COLLECTION DURATION DATANODE HIGH AVAILABILITY CONNECTIVITY DATANODE HOST HEALTH DATANODE LOG DIRECTORY FREE SPACE DATANODE CLOUDERA MANAGER AGENT HEALTH DATANODE UNEXPECTED EXITS DATANODE VOLUME FAILURES DATANODE WEB METRIC COLLECTION EVENT SERVER EVENT STORE SIZE... 19

4 EVENT SERVER FILE DESCRIPTOR EVENT SERVER HOST HEALTH EVENT SERVER INDEX DIRECTORY FREE SPACE EVENT SERVER LOG DIRECTORY FREE SPACE EVENT SERVER CLOUDERA MANAGER AGENT HEALTH EVENT SERVER UNEXPECTED EXITS EVENT SERVER WEB METRIC COLLECTION EVENT SERVER WRITE PIPELINE FAILOVER CONTROLLER FILE DESCRIPTOR FAILOVER CONTROLLER HOST HEALTH FAILOVER CONTROLLER LOG DIRECTORY FREE SPACE FAILOVER CONTROLLER CLOUDERA MANAGER AGENT HEALTH FAILOVER CONTROLLER UNEXPECTED EXITS FLUME AGENTS HEALTH HBASE BACKUP MASTERS HEALTH HBASE MASTER HEALTH HBASE REGIONSERVERS HEALTH HBASE REST SERVER FILE DESCRIPTOR HBASE REST SERVER HOST HEALTH HBASE REST SERVER LOG DIRECTORY FREE SPACE HBASE REST SERVER CLOUDERA MANAGER AGENT HEALTH HBASE REST SERVER UNEXPECTED EXITS HBASE THRIFT SERVER FILE DESCRIPTOR HBASE THRIFT SERVER HOST HEALTH HBASE THRIFT SERVER LOG DIRECTORY FREE SPACE HBASE THRIFT SERVER CLOUDERA MANAGER AGENT HEALTH HBASE THRIFT SERVER UNEXPECTED EXITS HDFS BLOCKS WITH CORRUPT REPLICAS HDFS CANARY HEALTH HDFS CORRUPT REPLICAS HDFS DATANODES HEALTH HDFS FREE SPACE REMAINING... 39

5 HDFS HIGH AVAILABILITY NAMENODE HEALTH HDFS MISSING BLOCKS HDFS NAMENODE HEALTH HDFS STANDBY NAMENODES HEALTH HDFS UNDER REPLICATED BLOCKS HOST AGENT LOG DIRECTORY FREE SPACE HOST AGENT PARCEL DIRECTORY FREE SPACE HOST AGENT PROCESS DIRECTORY FREE SPACE HOST CLOCK OFFSET HOST DNS RESOLUTION HOST DNS RESOLUTION DURATION HOST MEMORY SWAPPING HOST NETWORK FRAME ERRORS HOST NETWORK INTERFACES SLOW MODE HOST CLOUDERA MANAGER AGENT HEALTH HOST MONITOR FILE DESCRIPTOR HOST MONITOR HOST HEALTH HOST MONITOR HOST PIPELINE HOST MONITOR LOG DIRECTORY FREE SPACE HOST MONITOR CLOUDERA MANAGER AGENT HEALTH HOST MONITOR UNEXPECTED EXITS HOST MONITOR WEB METRIC COLLECTION HTTPFS FILE DESCRIPTOR HTTPFS HOST HEALTH HTTPFS LOG DIRECTORY FREE SPACE HTTPFS CLOUDERA MANAGER AGENT HEALTH HTTPFS UNEXPECTED EXITS IMPALA ASSIGNMENT LOCALITY IMPALA DAEMONS HEALTH IMPALA STATESTORE HEALTH IMPALAD CONNECTIVITY IMPALAD FILE DESCRIPTOR... 59

6 IMPALAD HOST HEALTH IMPALAD LOG DIRECTORY FREE SPACE IMPALAD MEMORY RESIDENT SET SIZE HEALTH IMPALAD CLOUDERA MANAGER AGENT HEALTH IMPALAD UNEXPECTED EXITS IMPALAD WEB METRIC COLLECTION JOBTRACKER FILE DESCRIPTOR JOBTRACKER GARBAGE COLLECTION DURATION JOBTRACKER HOST HEALTH JOBTRACKER LOG DIRECTORY FREE SPACE JOBTRACKER CLOUDERA MANAGER AGENT HEALTH JOBTRACKER UNEXPECTED EXITS JOBTRACKER WEB METRIC COLLECTION JOURNALNODE EDITS DIRECTORY FREE SPACE JOURNALNODE FILE DESCRIPTOR JOURNALNODE GARBAGE COLLECTION DURATION JOURNALNODE HOST HEALTH JOURNALNODE LOG DIRECTORY FREE SPACE JOURNALNODE CLOUDERA MANAGER AGENT HEALTH JOURNALNODE SYNC STATUS JOURNALNODE UNEXPECTED EXITS JOURNALNODE WEB METRIC COLLECTION MAPREDUCE HIGH AVAILABILITY JOBTRACKER HEALTH MAPREDUCE JOB FAILURE RATIO MAPREDUCE JOBTRACKER HEALTH MAPREDUCE MAPS LOCALITY MAPREDUCE MAP BACKLOG MAPREDUCE REDUCE BACKLOG MAPREDUCE STANDBY JOBTRACKERS HEALTH MAPREDUCE TASKTRACKERS HEALTH MASTER CANARY HEALTH MASTER FILE DESCRIPTOR... 78

7 MASTER GARBAGE COLLECTION DURATION MASTER HOST HEALTH MASTER LOG DIRECTORY FREE SPACE MASTER CLOUDERA MANAGER AGENT HEALTH MASTER UNEXPECTED EXITS MASTER WEB METRIC COLLECTION MANAGEMENT ACTIVITY MONITOR HEALTH MANAGEMENT ALERT PUBLISHER HEALTH MANAGEMENT EVENT SERVER HEALTH MANAGEMENT HOST MONITOR HEALTH MANAGEMENT NAVIGATOR HEALTH MANAGEMENT REPORTS MANAGER HEALTH MANAGEMENT SERVICE MONITOR HEALTH NAMENODE CHECKPOINT AGE NAMENODE DATA DIRECTORIES FREE SPACE NAMENODE DIRECTORY FAILURES NAMENODE FILE DESCRIPTOR NAMENODE GARBAGE COLLECTION DURATION NAMENODE HIGH AVAILABILITY CHECKPOINT AGE NAMENODE HOST HEALTH NAMENODE JOURNALNODE SYNC STATUS NAMENODE LOG DIRECTORY FREE SPACE NAMENODE RPC LATENCY NAMENODE SAFE MODE NAMENODE CLOUDERA MANAGER AGENT HEALTH NAMENODE UNEXPECTED EXITS NAMENODE UPGRADE STATUS NAMENODE WEB METRIC COLLECTION NAVIGATOR FILE DESCRIPTOR NAVIGATOR HOST HEALTH NAVIGATOR LOG DIRECTORY FREE SPACE NAVIGATOR CLOUDERA MANAGER AGENT HEALTH... 96

8 NAVIGATOR UNEXPECTED EXITS REGIONSERVER COMPACTION QUEUE REGIONSERVER FILE DESCRIPTOR REGIONSERVER FLUSH QUEUE REGIONSERVER GARBAGE COLLECTION DURATION REGIONSERVER HOST HEALTH REGIONSERVER LOG DIRECTORY FREE SPACE REGIONSERVER MASTER CONNECTIVITY REGIONSERVER MEMSTORE SIZE REGIONSERVER READ LATENCY REGIONSERVER CLOUDERA MANAGER AGENT HEALTH REGIONSERVER STORE FILE IDX SIZE REGIONSERVER SYNC LATENCY REGIONSERVER UNEXPECTED EXITS REGIONSERVER WEB METRIC COLLECTION REPORTS MANAGER FILE DESCRIPTOR REPORTS MANAGER HOST HEALTH REPORTS MANAGER LOG DIRECTORY FREE SPACE REPORTS MANAGER CLOUDERA MANAGER AGENT HEALTH REPORTS MANAGER SCRATCH DIRECTORY FREE SPACE REPORTS MANAGER UNEXPECTED EXITS SECONDARY NAMENODE CHECKPOINT DIRECTORIES FREE SPACE SECONDARY NAMENODE FILE DESCRIPTOR SECONDARY NAMENODE GARBAGE COLLECTION DURATION SECONDARY NAMENODE HOST HEALTH SECONDARY NAMENODE LOG DIRECTORY FREE SPACE SECONDARY NAMENODE CLOUDERA MANAGER AGENT HEALTH SECONDARY NAMENODE UNEXPECTED EXITS SECONDARY NAMENODE WEB METRIC COLLECTION ZOOKEEPER SERVER CONNECTION COUNT ZOOKEEPER SERVER DATA DIRECTORY FREE SPACE ZOOKEEPER SERVER DATA LOG DIRECTORY FREE SPACE

9 ZOOKEEPER SERVER FILE DESCRIPTOR ZOOKEEPER SERVER GARBAGE COLLECTION DURATION ZOOKEEPER SERVER HOST HEALTH ZOOKEEPER SERVER LOG DIRECTORY FREE SPACE ZOOKEEPER SERVER MAX LATENCY ZOOKEEPER SERVER OUTSTANDING REQUESTS ZOOKEEPER SERVER QUORUM MEMBERSHIP ZOOKEEPER SERVER CLOUDERA MANAGER AGENT HEALTH ZOOKEEPER SERVER UNEXPECTED EXITS SERVICE MONITOR FILE DESCRIPTOR SERVICE MONITOR HOST HEALTH SERVICE MONITOR LOG DIRECTORY FREE SPACE SERVICE MONITOR ROLE PIPELINE SERVICE MONITOR CLOUDERA MANAGER AGENT HEALTH SERVICE MONITOR UNEXPECTED EXITS SERVICE MONITOR WEB METRIC COLLECTION STATESTORE FILE DESCRIPTOR STATESTORE HOST HEALTH STATESTORE LOG DIRECTORY FREE SPACE STATESTORE MEMORY RESIDENT SET SIZE HEALTH STATESTORE CLOUDERA MANAGER AGENT HEALTH STATESTORE UNEXPECTED EXITS STATESTORE WEB METRIC COLLECTION TASKTRACKER BLACKLISTED TASKTRACKER CONNECTIVITY TASKTRACKER FILE DESCRIPTOR TASKTRACKER GARBAGE COLLECTION DURATION TASKTRACKER HOST HEALTH TASKTRACKER LOG DIRECTORY FREE SPACE TASKTRACKER CLOUDERA MANAGER AGENT HEALTH TASKTRACKER UNEXPECTED EXITS TASKTRACKER WEB METRIC COLLECTION

10 ZOOKEEPER CANARY HEALTH ZOOKEEPER CURRENT ZXID ZOOKEEPER SERVERS HEALTH

11 Activity Monitor Activity Monitor Pipeline Activity Monitor Activity Monitor Pipeline Details: This Activity Monitor health check checks that no messages are being dropped by the activity monitor stage of the Activity Monitor pipeline. A failure of this health check indicates a problem with the Activity Monitor. This may indicate a configuration problem or a bug in the Activity Monitor. This test can be configured using the Activity Monitor Activity Monitor Pipeline Monitoring Time Period monitoring setting. Short Name: Activity Monitor Pipeline Activity Monitor Activity Monitor Pipeline Monitoring The health check for monitoring the Activity Monitor activity monitor pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. activitymonitor_activity_ monitor_pipeline_ critical:any, warning:never Activity Monitor Activity Monitor Pipeline Monitoring Time Period The time period over which the Activity Monitor activity monitor pipeline will be monitored for dropped messages. activitymonitor_activity_ monitor_pipeline_ window 5 MINUTES Activity Monitor Activity Tree Pipeline Details: This Activity Monitor health check checks that no messages are being dropped by the activity tree stage of the Activity Monitor pipeline. A failure of this health check indicates a problem with the Activity Monitor. This may indicate a configuration problem or a bug in the Activity Monitor. This test can be configured using the Activity Monitor Activity Tree Pipeline Monitoring Time Period monitoring setting. Short Name: Activity Tree Pipeline Cloudera Manager 4.6 Health Checks 1

12 Activity Monitor File Descriptor Activity Monitor Activity Tree Pipeline Monitoring The health check activitymonitor_activity_ critical:any, for tree_pipeline_ warning:never monitoring the Activity Monitor activity tree pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. Activity Monitor Activity Tree Pipeline Monitoring Time Period The time period over which the Activity Monitor activity tree pipeline will be monitored for dropped messages. activitymonitor_activity_ tree_pipeline_window 5 MINUTES Activity Monitor File Descriptor Details: This Activity Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Activity Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Activity Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. activitymonitor_fd_ critical: , warning: Cloudera Manager 4.6 Health Checks

13 Activity Monitor Host Health Activity Monitor Host Health Details: This Activity Monitor health check factors in the health of the host upon which the Activity Monitor is running. A failure of this check means that the host running the Activity Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Activity Monitor Host Health Check Activity Monitor monitoring setting. Short Name: Host Health Activity Monitor Host Health Check When computing the overall Activity Monitor health, consider the host's health. activitymonitor_host_ health_enabled Activity Monitor Log Directory Free Space Details: This Activity Monitor health check checks that the filesystem containing the log directory of this Activity Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Activity Monitor monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , BYTES warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a log_directory_free_ space_percentage_ critical:never, warning:never Cloudera Manager 4.6 Health Checks 3

14 Activity Monitor Cloudera Manager Agent Health percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. Activity Monitor Cloudera Manager Agent Health Details: This Activity Monitor health check checks that the Cloudera Manager Agent on the Activity Monitor host is heart beating correctly and that the process associated with the Activity Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Activity Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Activity Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Activity Monitor has crashed or because the Activity Monitor will not start or stop in a timely fashion. Check the Activity Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Activity Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Activity Monitor host, or look in the Cloudera Manager Agent logs on the Activity Monitor host for more details. This test can be enabled or disabled using the Activity Monitor Process Health Check Activity Monitor monitoring setting. Short Name: Process Status Activity Monitor Process Health Check Enables the health check that the Activity Monitor's process state is consistent with the role configuration activitymonitor_scm_ health_enabled 4 Cloudera Manager 4.6 Health Checks

15 Activity Monitor Unexpected Exits Activity Monitor Unexpected Exits Details: This Activity Monitor health check checks that the Activity Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Activity Monitor monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check unexpected_exits_ for unexpected exits encountered within a recent period specified by the unexpected_exits_wind ow configuration for the role. critical:any, warning:never Activity Monitor Web Metric Collection Details: This Activity Monitor health check checks that the web server of the Activity Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Activity Monitor, a misconfiguration of the Activity Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Activity Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Activity Monitor's web server are failing or timing out. These requests are completely local to the Activity Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Activity Monitor's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Activity Monitor monitoring setting. Short Name: Web Server Status Cloudera Manager 4.6 Health Checks 5

16 Flume Agent File Descriptor Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. activitymonitor_web_ metric_collection_ enabled Flume Agent File Descriptor Details: This Agent health check checks that the number of file descriptors used does not rise above some percentage of the Agent file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Agent monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. flume_agent_fd_ critical: , warning: Flume Agent Host Health Details: This Agent health check factors in the health of the host upon which the Agent is running. A failure of this check means that the host running the Agent is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Flume Agent Host Health Check Agent monitoring setting. Short Name: Host Health Flume Agent Host When computing the overall Flume flume_agent_host_health 6 Cloudera Manager 4.6 Health Checks

17 Flume Agent Log Directory Free Space Health Check Agent health, consider the host's health. _enabled Flume Agent Log Directory Free Space Details: This Agent health check checks that the filesystem containing the log directory of this Agent has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Agent monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free _space_absolute_ critical: BYTES 00000, warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_ free_space_ percentage_ critical:never, warning:never Cloudera Manager 4.6 Health Checks 7

18 Flume Agent Cloudera Manager Agent Health Flume Agent Cloudera Manager Agent Health Details: This Agent health check checks that the Cloudera Manager Agent on the Agent host is heart beating correctly and that the process associated with the Agent role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Agent process, a lack of connectivity to the Cloudera Manager Agent on the Agent host, or a problem with the Cloudera Manager Agent. This check can fail either because the Agent has crashed or because the Agent will not start or stop in a timely fashion. Check the Agent logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Agent host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Agent host, or look in the Cloudera Manager Agent logs on the Agent host for more details. This test can be enabled or disabled using the Flume Agent Process Health Check Agent monitoring setting. Short Name: Process Status Flume Agent Process Health Check Enables the health check that the Flume Agent's process state is consistent with the role configuration flume_agent_scm_health _enabled Flume Agent Unexpected Exits Details: This Agent health check checks that the Agent has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Agent monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected_exits_ critical:any, warning:never 8 Cloudera Manager 4.6 Health Checks

19 Alert Publisher File Descriptor unexpected exits encountered within a recent period specified by the unexpected_exits_windo w configuration for the role. Alert Publisher File Descriptor Details: This Alert Publisher health check checks that the number of file descriptors used does not rise above some percentage of the Alert Publisher file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Alert Publisher monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. alertpublisher_fd_ critical: , warning: Alert Publisher Host Health Details: This Alert Publisher health check factors in the health of the host upon which the Alert Publisher is running. A failure of this check means that the host running the Alert Publisher is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Alert Publisher Host Health Check Alert Publisher monitoring setting. Short Name: Host Health Alert Publisher Host Health Check When computing the overall Alert alertpublisher_host_health _enabled Cloudera Manager 4.6 Health Checks 9

20 Alert Publisher Log Directory Free Space Publisher health, consider the host's health. Alert Publisher Log Directory Free Space Details: This Alert Publisher health check checks that the filesystem containing the log directory of this Alert Publisher has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Alert Publisher monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never 10 Cloudera Manager 4.6 Health Checks

21 Alert Publisher Cloudera Manager Agent Health Alert Publisher Cloudera Manager Agent Health Details: This Alert Publisher health check checks that the Cloudera Manager Agent on the Alert Publisher host is heart beating correctly and that the process associated with the Alert Publisher role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Alert Publisher process, a lack of connectivity to the Cloudera Manager Agent on the Alert Publisher host, or a problem with the Cloudera Manager Agent. This check can fail either because the Alert Publisher has crashed or because the Alert Publisher will not start or stop in a timely fashion. Check the Alert Publisher logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Alert Publisher host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Alert Publisher host, or look in the Cloudera Manager Agent logs on the Alert Publisher host for more details. This test can be enabled or disabled using the Alert Publisher Process Health Check Alert Publisher monitoring setting. Short Name: Process Status Alert Publisher Process Health Check Enables the health alertpublisher_scm_ check that the Alert health_enabled Publisher's process state is consistent with the role configuration Alert Publisher Unexpected Exits Details: This Alert Publisher health check checks that the Alert Publisher has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Alert Publisher monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected_exits_ critical:any, warning:never Cloudera Manager 4.6 Health Checks 11

22 DataNode Block Count unexpected exits encountered within a recent period specified by the unexpected_exits_wind ow configuration for the role. DataNode Block Count Details: This is a DataNode health check that checks for whether the DataNode has too many blocks. A failure of this health check indicates that there may be performance problems with the DataNode. See the DataNode system for more information. This test can be enabled or disabled using the DataNode Block Count DataNode monitoring setting. Short Name: Block Count DataNode Block Count The health check of the number of blocks on a DataNode datanode_block_ count_ critical:never, warning: DataNode Connectivity Details: This is a DataNode health check that checks that the NameNode considers the DataNode alive. A failure of this health check may indicate that the DataNode is having trouble communicating with the NameNode. Look in the DataNode logs for more details. This test can be enabled or disabled using the DataNode Connectivity Health Check DataNode monitoring setting. The DataNode Connectivity Tolerance at Startup DataNode monitoring setting and the Health Check Startup Tolerance NameNode monitoring setting can be used to control the check's tolerance windows around DataNode and NameNode restarts respectively. Short Name: NameNode Connectivity DataNode Connectivity Health Enables the health check that verifies the datanode_connectivity _health_enabled 12 Cloudera Manager 4.6 Health Checks

23 DataNode File Descriptor Check DataNode Connectivity Tolerance at Startup DataNode is connected to the NameNode The amount of time to datanode_connectivity wait for the DataNode to _tolerance fully start up and connect to the NameNode before enforcing the connectivity check. 180 SECONDS Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. namenode_startup_ tolerance 5 MINUTES DataNode File Descriptor Details: This DataNode health check checks that the number of file descriptors used does not rise above some percentage of the DataNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring DataNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. datanode_fd_ critical: , warning: Cloudera Manager 4.6 Health Checks 13

24 DataNode Free Space Remaining DataNode Free Space Remaining Details: This is a DataNode health check that checks that the amount of free space available for HDFS block data on the DataNode does not fall below some percentage of total configured capacity of the DataNode. A failure of this health check may indicate a capacity planning problem. Try adding more disk capacity and additional data directories to the DataNode, or add additional DataNodes and take steps to rebalance your HDFS cluster. This test can be configured using the DataNode Free Space Monitoring DataNode monitoring setting. Short Name: Free Space DataNode Free Space Monitoring The health check of free space in a DataNode. Specified as a percentage of the capacity on the DataNode. datanode_free_space_ critical: , warning: DataNode Garbage Collection Duration Details: This DataNode health check checks that the DataNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the DataNode. This test can be configured using the DataNode Garbage Collection Duration and DataNode Garbage Collection Duration Monitoring Period DataNode monitoring settings. Short Name: GC Duration DataNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. datanode_gc_duration _window 5 MINUTES DataNode Garbage Collection Duration The health check for the weighted average datanode_gc_duration _ critical: , warning: Cloudera Manager 4.6 Health Checks

25 DataNode High Availability Connectivity time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See DataNode Garbage Collection Duration Monitoring Period. DataNode High Availability Connectivity Details: This is a DataNode health check that checks that the all running NameNodes in the HDFS service consider the DataNode alive. A failure of this health check may indicate that the DataNode is having trouble communicating with some or all NameNodes in the service. Look in the DataNode logs for more details. This test can be enabled or disabled using the DataNode Connectivity Health Check DataNode monitoring setting. The DataNode Connectivity Tolerance at Startup DataNode monitoring setting and the Health Check Startup Tolerance NameNode monitoring setting can be used to control the check's tolerance windows around DataNode and NameNode restarts respectively. Short Name: NameNode Connectivity DataNode Connectivity Health Check Enables the health check that verifies the DataNode is connected to the NameNode datanode_connectivity_ health_enabled DataNode Connectivity Tolerance at Startup The amount of time to wait for the DataNode to fully start up and connect to the NameNode before enforcing the connectivity check. datanode_connectivity_ tolerance 180 SECONDS Health Check The amount of time allowed after this role namenode_startup_ 5 MINUTES Cloudera Manager 4.6 Health Checks 15

26 DataNode Host Health Startup Tolerance is started that failures of health checks that rely on communication with this role will be tolerated. tolerance DataNode Host Health Details: This DataNode health check factors in the health of the host upon which the DataNode is running. A failure of this check means that the host running the DataNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the DataNode Host Health Check DataNode monitoring setting. Short Name: Host Health DataNode Host Health Check When computing the overall DataNode health, consider the host's health. datanode_host_health_ enabled DataNode Log Directory Free Space Details: This DataNode health check checks that the filesystem containing the log directory of this DataNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage DataNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES 16 Cloudera Manager 4.6 Health Checks

27 DataNode Cloudera Manager Agent Health Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never DataNode Cloudera Manager Agent Health Details: This DataNode health check checks that the Cloudera Manager Agent on the DataNode host is heart beating correctly and that the process associated with the DataNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the DataNode process, a lack of connectivity to the Cloudera Manager Agent on the DataNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the DataNode has crashed or because the DataNode will not start or stop in a timely fashion. Check the DataNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the DataNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the DataNode host, or look in the Cloudera Manager Agent logs on the DataNode host for more details. This test can be enabled or disabled using the DataNode Process Health Check DataNode monitoring setting. Short Name: Process Status DataNode Process Health Check Enables the health datanode_scm_health check that the _enabled DataNode's process state is consistent with the role configuration Cloudera Manager 4.6 Health Checks 17

28 DataNode Unexpected Exits DataNode Unexpected Exits Details: This DataNode health check checks that the DataNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period DataNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check unexpected_exits_ for unexpected exits encountered within a recent period specified by the unexpected_exits_wind ow configuration for the role. critical:any, warning:never DataNode Volume Failures Details: This is a DataNode health check that checks for whether the DataNode has reported any failed volumes. A failure of this health check indicates that there is a problem with one or more volumes on the DataNode. See the DataNode system for more information. This test can be configured using the DataNode Volume Failures DataNode monitoring setting. Short Name: Data Directory Status DataNode Volume Failures The health check of failed volumes in a DataNode. datanode_volume_ failures_ critical:any, warning:never 18 Cloudera Manager 4.6 Health Checks

29 DataNode Web Metric Collection DataNode Web Metric Collection Details: This DataNode health check checks that the web server of the DataNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the DataNode, a misconfiguration of the DataNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the DataNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the DataNode's web server are failing or timing out. These requests are completely local to the DataNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the DataNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection DataNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. datanode_web_ metric_collection_ enabled Event Server Event Store Size Details: This is an Event Server health check that checks that the event store size has not grown too far above the configured event store capacity. A failure of this health check indicates that the Event Server is having a problem performing cleanup. This may indicate a configuration problem or bug in the Event Server. This test can be configured using the Event Store Capacity Monitoring Event Server monitoring setting. Short Name: Event Store Size Event Store Capacity Monitoring The health check on the number of events in the event store. Specified as a percentage of the eventserver_ capacity_ critical: , warning: Cloudera Manager 4.6 Health Checks 19

30 Event Server File Descriptor maximum number of events in Event Server store. Event Server File Descriptor Details: This Event Server health check checks that the number of file descriptors used does not rise above some percentage of the Event Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Event Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. eventserver_fd_ critical: , warning: Event Server Host Health Details: This Event Server health check factors in the health of the host upon which the Event Server is running. A failure of this check means that the host running the Event Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Event Server Host Health Check Event Server monitoring setting. Short Name: Host Health Event Server Host Health Check When computing the overall Event Server health, consider the host's health. eventserver_host_ health_enabled 20 Cloudera Manager 4.6 Health Checks

31 Event Server Index Directory Free Space Event Server Index Directory Free Space Details: This is an Event Server health check that checks that the filesystem containing the index directory of this Event Server has sufficient free space. This test can be configured using the Index Directory Free Space Monitoring Absolute and Index Directory Free Space Monitoring Percentage Event Server monitoring settings. Short Name: Index Directory Free Space Index Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the index directory. eventserver_index_ directory_free_space _absolute_ critical: , warning: BYTES Index Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the index directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if an Index Directory Free Space Monitoring Absolute setting is configured. eventserver_index_ critical:never, directory_free_space warning:never _percentage_ Event Server Log Directory Free Space Details: This Event Server health check checks that the filesystem containing the log directory of this Event Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Event Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the log_directory_free_ space_absolute_ critical: , warning: BYTES Cloudera Manager 4.6 Health Checks 21

32 Event Server Cloudera Manager Agent Health filesystem that contains this role's log directory Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Event Server Cloudera Manager Agent Health Details: This Event Server health check checks that the Cloudera Manager Agent on the Event Server host is heart beating correctly and that the process associated with the Event Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Event Server process, a lack of connectivity to the Cloudera Manager Agent on the Event Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Event Server has crashed or because the Event Server will not start or stop in a timely fashion. Check the Event Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Event Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Event Server host, or look in the Cloudera Manager Agent logs on the Event Server host for more details. This test can be enabled or disabled using the Event Server Process Health Check Event Server monitoring setting. Short Name: Process Status Event Server Process Health Check Enables the health check eventserver_scm_ that the Event Server's health_enabled process state is consistent with the role configuration 22 Cloudera Manager 4.6 Health Checks

33 Event Server Unexpected Exits Event Server Unexpected Exits Details: This Event Server health check checks that the Event Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Event Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check unexpected_exits_ for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. critical:any, warning:never Event Server Web Metric Collection Details: This Event Server health check checks that the web server of the Event Server is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Event Server, a misconfiguration of the Event Server or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Event Server for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Event Server's web server are failing or timing out. These requests are completely local to the Event Server's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Event Server's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Event Server monitoring setting. Short Name: Web Server Status Cloudera Manager 4.6 Health Checks 23

34 Event Server Write Pipeline Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. eventserver_web_ metric_collection_ enabled Event Server Write Pipeline Details: This Event Server health check checks that no messages are being dropped by the writer stage of the Event Server pipeline. A failure of this health check indicates a problem with the Event Server. This may indicate a configuration problem or a bug in the Event Server. This test can be configured using the Event Server Write Pipeline Monitoring Time Period monitoring setting. Short Name: Write Pipeline Event Server Write Pipeline Monitoring The health check for monitoring the Event Server write pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. eventserver_write_ pipeline_ critical:any, warning:never Event Server Write Pipeline Monitoring Time Period The time period over which the Event Server write pipeline will be monitored for dropped messages. eventserver_write_ pipeline_window 5 MINUTES Failover Controller File Descriptor Details: This Failover Controller health check checks that the number of file descriptors used does not rise above some percentage of the Failover Controller file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be 24 Cloudera Manager 4.6 Health Checks

35 Failover Controller Host Health configured using the File Descriptor Monitoring Failover Controller monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. failovercontroller_fd _ critical: , warning: Failover Controller Host Health Details: This Failover Controller health check factors in the health of the host upon which the Failover Controller is running. A failure of this check means that the host running the Failover Controller is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the FailoverController Host Health Check Failover Controller monitoring setting. Short Name: Host Health FailoverController Host Health Check When computing the overall FailoverController health, consider the host's health. failovercontroller_ host_health_enabled Failover Controller Log Directory Free Space Details: This Failover Controller health check checks that the filesystem containing the log directory of this Failover Controller has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Failover Controller monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log log_directory_free_ space_absolute_ critical: , warning: BYTES Cloudera Manager 4.6 Health Checks 25

36 Failover Controller Cloudera Manager Agent Health directory. Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Failover Controller Cloudera Manager Agent Health Details: This Failover Controller health check checks that the Cloudera Manager Agent on the Failover Controller host is heart beating correctly and that the process associated with the Failover Controller role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Failover Controller process, a lack of connectivity to the Cloudera Manager Agent on the Failover Controller host, or a problem with the Cloudera Manager Agent. This check can fail either because the Failover Controller has crashed or because the Failover Controller will not start or stop in a timely fashion. Check the Failover Controller logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Failover Controller host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Failover Controller host, or look in the Cloudera Manager Agent logs on the Failover Controller host for more details. This test can be enabled or disabled using the FailoverController Process Health Check Failover Controller monitoring setting. Short Name: Process Status FailoverController Process Health Check Enables the health check that the FailoverController's process state is consistent with the role configuration failovercontroller_ scm_health_enabled 26 Cloudera Manager 4.6 Health Checks

37 Failover Controller Unexpected Exits Failover Controller Unexpected Exits Details: This Failover Controller health check checks that the Failover Controller has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Failover Controller monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check unexpected_exits_ for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. critical:any, warning:never Flume Agents Health Details: This is a Flume service-level health check that checks that enough of the Flume agents in the cluster are healthy. The check returns "Concerning" health if the number of healthy Flume agents falls below a warning threshold, expressed as a percentage of the total number of Flume agents. The check returns "Bad" health if the number of healthy and "Concerning" Flume agents falls below a critical threshold, expressed as a percentage of the total number of Flume agents. For example, if this check is configured with a warning threshold of 80% and a critical threshold of 60% for a cluster of five Flume agents, this check would return "Good" health if four or more Flume agents have good health. This check would return "Concerning" health if at least three Flume agents have either "Good" or "Concerning" health. If more than two Flume agents have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy Flume agents. Check the status of the individual Flume agents for more information. This test can be configured using the Healthy Flume Agent Monitoring Flume service-wide monitoring setting. Short Name: Flume Agents Health Cloudera Manager 4.6 Health Checks 27

38 HBase Backup Masters Health Healthy Flume Agent Monitoring The health check flume_agents_ of the overall Flume Agents healthy_ health. The check returns "Concerning" health if the percentage of "Healthy" Flume Agents falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Flume Agents falls below the critical threshold. critical:never, warning: HBase Backup Masters Health Details: This is an HBase service-level health check that checks for running, healthy HBase Masters in backup mode. The check is disabled if the HBase service is not configured with multiple HBase Masters. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no HBase Master running in backup mode. Second, if any of the HBase Masters running in backup mode are in less than "Good" health. This second condition is included because a failure of the active HBase Master leads to a race condition between all backup HBase Masters. When there is a less than healthy backup HBase Master, it is possible that it could become the active HBase Master if it won such a race, and the HBase service could end up with a less than healthy active HBase Master. A failure of this health check may indicate one or more stopped or unhealthy backup HBase Masters, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the HBase service. Check the status of the HBase service's Master roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Backup Masters Health Check HBase service-wide monitoring setting. In addition, the HBase Active Master Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HBase Master before this health check fails. Short Name: Backup HBase Master Health Backup Masters Health Check When computing the overall HBase cluster health, consider the health of the backup HBase hbase_backup_masters _health_enabled 28 Cloudera Manager 4.6 Health Checks

39 HBase Master Health Masters. HBase Active Master Detection Window The tolerance window that will be used in HBase service tests that depend on detection of the active HBase Master. hbase_active_master_ detecton_window 3 MINUTES HBase Master Health Details: This is an HBase service-level health check that checks for the presence of an active, running and healthy HBase Master. The check returns "Bad" health if the service is running and a running, active Master cannot be found. In all other cases it returns the health of the running, active Master. A failure of this health check may indicate stopped or unhealthy Master roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the HBase service. Check the status of the HBase service's Master roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Active Master Health Check HBase service-wide monitoring setting. In addition, the HBase Active Master Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HBase Master before this health check fails. Short Name: Active HBase Master Health Active Master Health Check When computing the overall HBase cluster health, consider the active HBase Master's health. hbase_master_health _enabled HBase Active Master Detection Window The tolerance window that will be used in HBase service tests that depend on detection of the active HBase Master. hbase_active_master _detecton_window 3 MINUTES Cloudera Manager 4.6 Health Checks 29

40 HBase RegionServers Health HBase RegionServers Health Details: This is an HBase service-level health check that checks that enough of the RegionServers in the cluster are healthy. The check returns "Concerning" health if the number of healthy RegionServers falls below a warning threshold, expressed as a percentage of the total number of RegionServers. The check returns "Bad" health if the number of healthy and "Concerning" RegionServers falls below a critical threshold, expressed as a percentage of the total number of RegionServers. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 RegionServers, this check would return "Good" health if 95 or more RegionServers have good health. This check would return "Concerning" health if at least 90 RegionServers have either "Good" or "Concerning" health. If more than 10 RegionServers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy RegionServers. Check the status of the individual RegionServers for more information. This test can be configured using the Healthy HBase RegionServers Monitoring HBase service-wide monitoring setting. Short Name: RegionServers Health Healthy HBase Region Servers Monitoring The health check of the overall HBase RegionServers health. The check returns "Concerning" health if the percentage of "Healthy" HBase RegionServers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" HBase RegionServers falls below the critical threshold. hbase_regionservers _healthy_ critical: , warning: HBase REST Server File Descriptor Details: This HBase REST Server health check checks that the number of file descriptors used does not rise above some percentage of the HBase REST Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HBase REST Server monitoring setting. Short Name: File Descriptors 30 Cloudera Manager 4.6 Health Checks

41 HBase REST Server Host Health File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hbaserestserver_fd_ critical: , warning: HBase REST Server Host Health Details: This HBase REST Server health check factors in the health of the host upon which the HBase REST Server is running. A failure of this check means that the host running the HBase REST Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase REST Server Host Health Check HBase REST Server monitoring setting. Short Name: Host Health HBase Rest Server Host Health Check When computing the hbaserestserver_host overall HBase REST Server _health_enabled health, consider the host's health. HBase REST Server Log Directory Free Space Details: This HBase REST Server health check checks that the filesystem containing the log directory of this HBase REST Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HBase REST Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Cloudera Manager 4.6 Health Checks 31

42 HBase REST Server Cloudera Manager Agent Health Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_s pace_percentage_ critical:never, warning:never HBase REST Server Cloudera Manager Agent Health Details: This HBase REST Server health check checks that the Cloudera Manager Agent on the HBase REST Server host is heart beating correctly and that the process associated with the HBase REST Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the HBase REST Server process, a lack of connectivity to the Cloudera Manager Agent on the HBase REST Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the HBase REST Server has crashed or because the HBase REST Server will not start or stop in a timely fashion. Check the HBase REST Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HBase REST Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HBase REST Server host, or look in the Cloudera Manager Agent logs on the HBase REST Server host for more details. This test can be enabled or disabled using the HBase REST Server Process Health Check HBase REST Server monitoring setting. Short Name: Process Status HBase Rest Server Process Health Check Enables the health check that the HBase REST Server's process state is consistent with the role configuration hbaserestserver_scm_ health_enabled 32 Cloudera Manager 4.6 Health Checks

43 HBase REST Server Unexpected Exits HBase REST Server Unexpected Exits Details: This HBase REST Server health check checks that the HBase REST Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HBase REST Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check unexpected_exits_ for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. critical:any, warning:never HBase Thrift Server File Descriptor Details: This HBase Thrift Server health check checks that the number of file descriptors used does not rise above some percentage of the HBase Thrift Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HBase Thrift Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hbasethriftserver_ fd_ critical: , warning: Cloudera Manager 4.6 Health Checks 33

44 HBase Thrift Server Host Health HBase Thrift Server Host Health Details: This HBase Thrift Server health check factors in the health of the host upon which the HBase Thrift Server is running. A failure of this check means that the host running the HBase Thrift Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Thrift Server Host Health Check HBase Thrift Server monitoring setting. Short Name: Host Health HBase Thrift Server Host Health Check When computing the overall HBase Thrift Server health, consider the host's health. hbasethriftserver_host _health_enabled HBase Thrift Server Log Directory Free Space Details: This HBase Thrift Server health check checks that the filesystem containing the log directory of this HBase Thrift Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HBase Thrift Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: BYTES 00000, warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring log_directory_free_ space_percentage_ critical:never, warning:never 34 Cloudera Manager 4.6 Health Checks

45 HBase Thrift Server Cloudera Manager Agent Health Absolute setting is configured. HBase Thrift Server Cloudera Manager Agent Health Details: This HBase Thrift Server health check checks that the Cloudera Manager Agent on the HBase Thrift Server host is heart beating correctly and that the process associated with the HBase Thrift Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the HBase Thrift Server process, a lack of connectivity to the Cloudera Manager Agent on the HBase Thrift Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the HBase Thrift Server has crashed or because the HBase Thrift Server will not start or stop in a timely fashion. Check the HBase Thrift Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HBase Thrift Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HBase Thrift Server host, or look in the Cloudera Manager Agent logs on the HBase Thrift Server host for more details. This test can be enabled or disabled using the HBase Thrift Server Process Health Check HBase Thrift Server monitoring setting. Short Name: Process Status HBase Thrift Server Process Health Check Enables the health check that the HBase Thrift Server's process state is consistent with the role configuration hbasethriftserver_ scm_health_enabled HBase Thrift Server Unexpected Exits Details: This HBase Thrift Server health check checks that the HBase Thrift Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HBase Thrift Server monitoring settings. Short Name: Unexpected Exits Cloudera Manager 4.6 Health Checks 35

46 HDFS Blocks With Corrupt Replicas Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never HDFS Blocks With Corrupt Replicas Details: This is an HDFS service-level health check that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring HDFS service-wide monitoring setting. Short Name: Corrupt Blocks Blocks With Corrupt The health check Replicas Monitoring of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks. hdfs_blocks_with_ corrupt_replicas_ critical: , warning: Cloudera Manager 4.6 Health Checks

47 HDFS Canary Health HDFS Canary Health Details: This is an HDFS service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_<i>timestamp</i>. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The check returns "Bad" health if any of the basic operations fail. The check returns "Concerning" health if the canary test runs too slowly. A failure of this health check may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health checks. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS service-wide monitoring setting. Short Name: HDFS Canary HDFS Canary Health Check Enables the health check that a client can create, read, write, and delete files hdfs_canary_health_ enabled HDFS Corrupt Replicas Details: This is an HDFS service-level health check that checks that the number of corrupt replicas does not rise above some percentage of the cluster's total blocks. A block in HDFS is usually made up of multiple replicas, so a corrupt replica does not by itself indicate unavailable data. Unavailable data is indicated by missing blocks. Corrupt replicas do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS. HDFS automatically fixes corrupt replicas in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Corrupt Replicas Monitoring HDFS service-wide monitoring setting. Note that the percentage here that we are thresholding is a computation of replicas divided by blocks, so it could be more than 100% in some cases. Short Name: Corrupt Replicas Corrupt Replicas The health check hdfs_corrupt_blocks_ critical: , Cloudera Manager 4.6 Health Checks 37

48 HDFS DataNodes Health Monitoring of the number of corrupt replica. Specified as a percentage of the total number of blocks. Note that there are more replicas than blocks, so it is theoretically possible for this to be over one hundred percent. warning: HDFS DataNodes Health Details: This is an HDFS service-level health check that checks that enough of the DataNodes in the cluster are healthy. The check will return "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The check returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this check would return "Good" health if 95 or more DataNodes have good health. This check would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the Healthy DataNodes Monitoring HDFS service-wide monitoring setting. Short Name: DataNodes Health Healthy DataNodes Monitoring The health check hdfs_datanodes_ of the overall healthy_ DataNodes health. The check returns "Concerning" health if the percentage of "Healthy" DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of critical: , warning: Cloudera Manager 4.6 Health Checks

49 HDFS Free Space Remaining "Healthy" and "Concerning" DataNodes falls below the critical threshold. HDFS Free Space Remaining Details: This is an HDFS service-level health check that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health check may indicate a capacity planning problem, or a loss of DataNodes. This test can be configured using the HDFS Free Space Monitoring HDFS service-wide monitoring setting. Short Name: Free Space HDFS Free Space Monitoring The health check of free space in HDFS. Specified as a percentage of total HDFS capacity. hdfs_free_space_ critical: , warning: HDFS High Availability NameNode Health Details: This is an HDFS service-level health check that checks for the presence of an active, running and healthy NameNode. The check returns "Bad" health if the service is running and a running, active NameNode cannot be found. In all other cases it returns the health of the running, active NameNode. A failure of this health check may indicate stopped or unhealthy NameNode roles, the need to issue a failover command to make some NameNode active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HDFS NameNode before this health check fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the check allows for a NameNode to be made active. Short Name: Active NameNode Health Cloudera Manager 4.6 Health Checks 39

50 HDFS Missing Blocks Active NameNode Detection Window The tolerance window hdfs_active_namenode_ that will be used in HDFS detecton_window service tests that depend on detection of the active NameNode. 3 MINUTES Active NameNode Role Health Check When computing the overall HDFS cluster health, consider the active NameNode's health hdfs_namenode_health_ enabled NameNode Activation Startup Tolerance The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. hdfs_namenode_ activation_startup_ tolerance 180 SECONDS HDFS Missing Blocks Details: This is an HDFS service-level health check that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decomissioned. A failure of this health check may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring HDFS service-wide monitoring setting. Short Name: Missing Blocks 40 Cloudera Manager 4.6 Health Checks

51 HDFS NameNode Health Missing Block Monitoring The health check of the number of missing blocks. Specified as a percentage of the total number of blocks. hdfs_missing_blocks_ critical:any, warning:never HDFS NameNode Health Details: This HDFS service-level health check checks for the presence of a running, healthy NameNode. The check returns "Bad" health if the service is running and the NameNode is not running. In all other cases it returns the health of the NameNode. A failure of this health check indicates a stopped or unhealthy NameNode. Check the status of the NameNode for more information. This test can be enabled or disabled using the Active NameNode Role Health Check NameNode service-wide monitoring setting. Short Name: NameNode Health Active NameNode Role Health Check When computing the overall HDFS cluster health, consider the active NameNode's health hdfs_namenode_ health_enabled HDFS Standby NameNodes Health Details: This is an HDFS service-level health check that checks for a running, healthy NameNode in standby mode. The check is disabled if the HDFS service is not configured with multiple NameNodes. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no NameNode running in standby mode. Second, if the running standby NameNode is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy NameNodes, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the HDFS NameNodes. Check the status of the HDFS service's NameNode roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby NameNode Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health check fails. Short Name: Standby NameNode Health Cloudera Manager 4.6 Health Checks 41

52 HDFS Under Replicated Blocks Active NameNode Detection Window The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. hdfs_active_namenode _detecton_window 3 MINUTES Standby NameNode Health Check When computing the overall HDFS cluster health, consider the health of the standby NameNode. hdfs_standby_ namenodes_health_ enabled HDFS Under Replicated Blocks Details: This is an HDFS service-level health check that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health check may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain underreplicated blocks. This test can be configured using the Under-replicated Block Monitoring HDFS service-wide monitoring setting. Short Name: Under-Replicated Blocks Under-replicated Block Monitoring The health check of the number of underreplicated blocks. Specified as a percentage of the total number of blocks. hdfs_under_replicated_ blocks_ critical: , warning: Host Agent Log Directory Free Space Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's log directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Log Directory Free Space Monitoring Absolute and Cloudera Manager Agent Log Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Log Directory 42 Cloudera Manager 4.6 Health Checks

53 Host Agent Parcel Directory Free Space Cloudera Manager Agent Log Directory Free Space Monitoring Absolute Cloudera Manager Agent Log Directory Free Space Monitoring Percentage The health check host_agent_log_ for monitoring directory_free_space_ of free space on the absolute_ filesystem that contains the Cloudera Manager agent's log directory. The health check host_agent_log_ for monitoring directory_free_space_ of free space on the percentage_ filesystem that contains the Cloudera Manager agent's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Log Directory Free Space Monitoring Absolute setting is configured. critical: , warning: critical:never, warning:never BYTES Host Agent Parcel Directory Free Space Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's parcel directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute and Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Parcel Directory Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute The health check host_agent_parcel_ for monitoring directory_free_space_ of free space on the absolute_ filesystem that contains the Cloudera Manager critical: , warning: BYTES Cloudera Manager 4.6 Health Checks 43

54 Host Agent Process Directory Free Space agent's parcel directory. Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage The health check host_agent_parcel_ for monitoring directory_free_space_ of free space on the percentage_ filesystem that contains the Cloudera Manager agent's parcel directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Host Agent Process Directory Free Space Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's process directory has sufficient free space. The process directory contains the configuration files for the processes which the Cloudera Manager Agent starts. This test can be configured using the Cloudera Manager Agent Process Directory Free Space Monitoring Absolute and Cloudera Manager Agent Process Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Process Directory Cloudera Manager Agent Process Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's process directory. host_agent_process_ directory_free_space_ absolute_ critical: , warning: BYTES Cloudera Manager Agent Process The health check host_agent_process_ for monitoring directory_free_space_ critical:never, 44 Cloudera Manager 4.6 Health Checks

55 Host Clock Offset Directory Free Space Monitoring Percentage of free space on the filesystem that contains the Cloudera Manager agent's process directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Process Directory Free Space Monitoring Absolute setting is configured. percentage_ warning:never Host Clock Offset Details: This is a host health check that checks if the host's system clock appears to be out-of-sync. A failure of this health check may indicate that the host's system clock needs to be synchronized with other cluster nodes. It is recommended that NTP be used to synchronize system clocks across the cluster. Check the Cloudera Manager agent log for more details. The clock offset metric is calculated by comparing system clock values after accounting for measured network latency. In normal operating conditions the estimated offset will be non-zero and may vary slightly due to measurement error. This test can be configured using the Host Clock Offset host configuration setting. Short Name: Clock Offset Host Clock Offset The health check for the host clock offset host_clock_offset_ critical: , warning: MILLISECONDS Host DNS Resolution Details: This is a host health check that checks that the host's hostname and canonical name are consistent when checked from a Java process. A failure of this health check may indicate that the host's DNS configuration is not correct. Check the Cloudera Manager Agent log for the names that were detected by this check. The hostname and canonical name are considered to be consistent if the hostname or the hostname plus a domain name is the same as the canonical name. This health check Cloudera Manager 4.6 Health Checks 45

56 Host DNS Resolution Duration uses domain names from the domain and search lines in /etc/resolv.conf. This health check does not consult /etc/nsswitch.conf and may give incorrect results if /etc/resolv.conf is not used by the host. There may be a delay of up to 5 minutes before this health check picks up changes to /etc/resolv.conf. This test can be configured using the Hostname and Canonical Name Health Check host configuration setting. Short Name: DNS Resolution Hostname and Canonical Name Health Check Whether the hostname and host_dns_resolution canonical names for this _enabled host are consistent when checked from a Java process. Host DNS Resolution Duration Details: This is a host health check that checks that the host's DNS resolution completes in a timely manner. The DNS resolution duration is calculated by measuring the time that a call to getlocalhost in a Java process takes on this host. Please note that DNS information may be cached on the host and this caching may affect the reported resolution duration. A failure of this health check may indicate that the host's DNS configuration is set incorrectly or the hosts's DNS server is responding slowly. This test can be configured using the Host DNS Resolution Duration host configuration setting. Short Name: DNS Resolution Duration Host DNS The health check host_dns_resolution_ Resolution Duration for the duration_ host DNS resolution duration critical:never, warning: MILLISECONDS Host Memory Swapping Details: This is a health check that checks that the host has not swapped out more than a certain number of pages over the last fifteen minutes. A failure of this health check may indicate misconfiguration of the host operating system, or too many processes running on the host. Try reducing vm.swappiness, or add more memory to the host. This test can be configured using the Host Memory Swapping, Host Memory Swapping Check Window host configuration settings. Short Name: Swapping 46 Cloudera Manager 4.6 Health Checks

57 Host Network Frame Errors Host Memory Swapping Check Window The amount of time over which the memory swapping test checks for pages swapped. host_memswap_window 15 MINUTES Host Memory Swapping The health check host_memswap_ of the number of pages swapped out on the host in the last 15 minutes critical:never, warning:any PAGES Host Network Frame Errors Details: This is a host health check that checks for network frame errors across all network interfaces. A failure of this health check may indicate a problem with network hardware (e.g. switches) and can potentially cause other service or role-level performance problems. Check the host and network hardware logs for more details. This test can be configured using the Host Network Frame Error Percentage, Host Network Frame Error Check Window, Host Network Frame Error Test Minimum Required Packets host configuration settings. Short Name: Frame Errors Host Network Frame Error Check Window The amount of time over which the host frame error checks for frame errors. host_network_frame_ errors_window 15 MINUTES Host Network Frame Error Percentage The health check for the percentage of received packets that are frame errors. host_network_frame_ errors_ critical: , warning:any Host Network Frame Error Test Minimum Required Packets The minimum number host_network_frame_ of received packets that errors_floor must be received within the test window 0 Cloudera Manager 4.6 Health Checks 47

58 Host Network Interfaces Slow Mode for this test to return "Bad" health. If less that this number of packets is received during the test window, the health check will never return "Bad" health. Host Network Interfaces Slow Mode Details: This is a host health check that checks for network interfaces that appear to be operating at less than full speed. A failure of this health check may indicate that network interface(s) may be configured incorrectly and may be causing performance problems. Use the ethtool command to check and configure the host's network interfaces to use the fastest available link speed and duplex mode. This test can be configured using the Host's Network Interfaces Slow Link Modes, Network Interface Expected Link Speed and Network Interface Expected Duplex Mode host configuration settings. Short Name: Network Interface Speed Host's Network Interfaces Slow Link Modes The for the health check of the number of network interfaces that appear to be operating at less than full speed. host_network_ interfaces_slow_mode _ critical:never, warning:any Network Interface Expected Duplex Mode The expected duplex mode for network interfaces host_nic_expected_ duplex_mode Full Network Interface Expected Link Speed The expected network interface link speed host_nic_expected_ speed Cloudera Manager 4.6 Health Checks

59 Host Cloudera Manager Agent Health Host Cloudera Manager Agent Health Details: This is a host health check that checks that the host's Cloudera Manager Agent is heart beating correctly and has the correct software version. A failure of this health check may indicate a lack of connectivity with the host's Cloudera Manager Agent, a problem with the Cloudera Manager Agent, or that the Cloudera Manager Agent or Host Monitor software is out of date. Check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the host, or look in the host's Cloudera Manager Agent logs for more details. If this check reports a software version mismatch between the Cloudera Manager Agent and the Host Monitor, check the version of each component by consulting the appropriate logs or the appropriate status web pages. This test can be enabled or disabled using the Host Process Health Check host configuration setting. Short Name: Agent Status Host Process Health Check Enables the health check that the host's process state is consistent with the role configuration host_scm_health_ enabled Host Monitor File Descriptor Details: This Host Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Host Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Host Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hostmonitor_fd_ critical: , warning: Host Monitor Host Health Details: This Host Monitor health check factors in the health of the host upon which the Host Monitor is running. A failure of this check means that the host running the Host Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Host Cloudera Manager 4.6 Health Checks 49

60 Host Monitor Host Pipeline Monitor Host Health Check Host Monitor monitoring setting. Short Name: Host Health Host Monitor Host Health Check When computing the overall Host Monitor health, consider the host's health. hostmonitor_host_health _enabled Host Monitor Host Pipeline Details: This Host Monitor health check checks that no messages are being dropped by the host stage of the Host Monitor pipeline. A failure of this health check indicates a problem with the Host Monitor. This may indicate a configuration problem or a bug in the Host Monitor. This test can be configured using the Host Monitor Host Pipeline Monitoring Time Period monitoring setting. Short Name: Host Pipeline Host Monitor Host Pipeline Monitoring The health check for monitoring the Host Monitor host pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. hostmonitor_host_ pipeline_ critical:any, warning:never Host Monitor Host Pipeline Monitoring Time Period The time period over which the Host Monitor host pipeline will be monitored for dropped messages. hostmonitor_host_ pipeline_window 5 MINUTES Host Monitor Log Directory Free Space Details: This Host Monitor health check checks that the filesystem containing the log directory of this Host Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Host Monitor monitoring settings. Short Name: Log Directory Free Space 50 Cloudera Manager 4.6 Health Checks

61 Host Monitor Cloudera Manager Agent Health Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Host Monitor Cloudera Manager Agent Health Details: This Host Monitor health check checks that the Cloudera Manager Agent on the Host Monitor host is heart beating correctly and that the process associated with the Host Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Host Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Host Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Host Monitor has crashed or because the Host Monitor will not start or stop in a timely fashion. Check the Host Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Host Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Host Monitor host, or look in the Cloudera Manager Agent logs on the Host Monitor host for more details. This test can be enabled or disabled using the Host Monitor Process Health Check Host Monitor monitoring setting. Short Name: Process Status Host Monitor Process Health Enables the health check that the Host Monitor's hostmonitor_scm_ health_enabled Cloudera Manager 4.6 Health Checks 51

62 Host Monitor Unexpected Exits Check process state is consistent with the role configuration Host Monitor Unexpected Exits Details: This Host Monitor health check checks that the Host Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Host Monitor monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Host Monitor Web Metric Collection Details: This Host Monitor health check checks that the web server of the Host Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Host Monitor, a misconfiguration of the Host Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Host Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Host Monitor's web server are failing or timing out. These requests are completely local to the Host Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Host Monitor's web server responded 52 Cloudera Manager 4.6 Health Checks

63 HttpFS File Descriptor to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Host Monitor monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. hostmonitor_web_ metric_collection_ enabled HttpFS File Descriptor Details: This HttpFS health check checks that the number of file descriptors used does not rise above some percentage of the HttpFS file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HttpFS monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. httpfs_fd_ critical: , warning: HttpFS Host Health Details: This HttpFS health check factors in the health of the host upon which the HttpFS is running. A failure of this check means that the host running the HttpFS is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HttpFS Host Health Check HttpFS monitoring setting. Short Name: Host Health Cloudera Manager 4.6 Health Checks 53

64 HttpFS Log Directory Free Space HttpFS Host Health Check When computing the overall HttpFS health, consider the host's health. httpfs_host_health_ enabled HttpFS Log Directory Free Space Details: This HttpFS health check checks that the filesystem containing the log directory of this HttpFS has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HttpFS monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: BYTES , warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never HttpFS Cloudera Manager Agent Health Details: This HttpFS health check checks that the Cloudera Manager Agent on the HttpFS host is heart beating correctly and that the process associated with the HttpFS role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the HttpFS process, a lack 54 Cloudera Manager 4.6 Health Checks

65 HttpFS Unexpected Exits of connectivity to the Cloudera Manager Agent on the HttpFS host, or a problem with the Cloudera Manager Agent. This check can fail either because the HttpFS has crashed or because the HttpFS will not start or stop in a timely fashion. Check the HttpFS logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HttpFS host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HttpFS host, or look in the Cloudera Manager Agent logs on the HttpFS host for more details. This test can be enabled or disabled using the HttpFS Process Health Check HttpFS monitoring setting. Short Name: Process Status HttpFS Process Health Check Enables the health check that the HttpFS's process state is consistent with the role configuration httpfs_scm_health_ enabled HttpFS Unexpected Exits Details: This HttpFS health check checks that the HttpFS has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HttpFS monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Cloudera Manager 4.6 Health Checks 55

66 Impala Assignment Locality Impala Assignment Locality Details: This is an Impala service-level health check that checks that a sufficient percentage of recent assignments are operating on local data. The check returns "Concerning" or "Bad" health if the number of assignments operating on local data falls below a warning or critical. The number of assignments observed during the check's window must be above a configured minimum number of assignments for the check to be enabled. A failure of this health check may indicate problems with the configuration of the Impala service. Check that each Impala Daemon is co-located with a DataNode, and that the IP address of each Impala Daemon matches the IP address of its co-located DataNode. This test can be configured using the Assignment Locality Ratio, Assignment Locality Monitoring Period and Assignment Locality Minimum Assignments Impala service-wide monitoring settings. Short Name: Assignment Locality Assignment Locality Minimum Assignments The minimum number of assignments that must occur during the test time period before the threshold values will be checked. Until this number of assignments have been observed in the test time period the health check will be disabled. impala_assignment _locality_minimum 10 Assignment Locality Monitoring Period The time period over which to compute the assignment locality ratio. Specified in minutes. impala_assignment _locality_window 15 MINUTES Assignment Locality Ratio The health check for assignment locality. Specified as a percentage of total assignments. impala_assignment critical: , _locality_ warning: Impala Daemons Health Details: This is an Impala service-level health check that checks that enough of the Impala Daemons in the cluster are healthy. The check returns "Concerning" health if the number of healthy Impala Daemons falls below a warning threshold, expressed as a percentage of the total number of Impala Daemons. The check returns "Bad" health if the number of healthy and "Concerning" Impala Daemons falls below a 56 Cloudera Manager 4.6 Health Checks

67 Impala StateStore Health critical threshold, expressed as a percentage of the total number of Impala Daemons. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 Impala Daemons, this check would return "Good" health if 95 or more Impala Daemons have good health. This check would return "Concerning" health if at least 90 Impala Daemons have either "Good" or "Concerning" health. If more than 10 Impala Daemons have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy Impala Daemons. Check the status of the individual Impala Daemons for more information. This test can be configured using the Healthy Impala Daemon Monitoring Impala service-wide monitoring setting. Short Name: Impala Daemons Health Healthy Impala Daemon Monitoring The health check of the overall Impala Daemons health. The check returns "Concerning" health if the percentage of "Healthy" Impala Daemons falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Impala Daemons falls below the critical threshold. impala_impalads_ critical: , healthy_ warning: Impala StateStore Health Details: This Impala service-level health check checks for the presence of a running, healthy Impala StateStore Daemon. The check returns "Bad" health if the service is running and the Impala StateStore Daemon is not running. In all other cases it returns the health of the Impala StateStore Daemon. A failure of this health check indicates a stopped or unhealthy Impala StateStore Daemon. Check the status of the Impala StateStore Daemon for more information. This test can be enabled or disabled using the StateStore Role Health Check Impala StateStore Daemon service-wide monitoring setting. Short Name: StateStore Health StateStore Role Health Check When computing the overall Impala cluster health, consider the impala_statestore_ health_enabled Cloudera Manager 4.6 Health Checks 57

68 Impalad Connectivity StateStore's health Impalad Connectivity Details: This is an Impala Daemon health check that checks that the StateStore considers the Impala Daemon alive. A failure of this health check may indicate that the Impala Daemon is having trouble communicating with the StateStore. Look in the Impala Daemon logs for more details. This check may return an unknown result if the Service Monitor is not able to communicate with the StateStore web server. Check the status of the StateStore web server and the Service Monitor logs if this check is returning an unknown result. This test can be enabled or disabled using the Impala Daemon Connectivity Health Check Impala Daemon monitoring setting. The Impala Daemon Connectivity Tolerance at Startup Impala Daemon monitoring setting and the Health Check Startup Tolerance StateStore monitoring setting can be used to control the check's tolerance windows around Impala Daemon and StateStore restarts respectively. Short Name: StateStore Connectivity Health Check Startup Tolerance The amount of time allowed statestore_startup_ after this role is started that tolerance failures of health checks that rely on communication with this role will be tolerated. 5 MINUTES Impala Daemon Connectivity Health Check Enables the health check that verifies the Impala Daemon is connected to the StateStore impalad_connectivity _health_enabled Impala Daemon Connectivity Tolerance at Startup The amount of time to wait for the Impala Daemon to fully start up and connect to the StateStore before enforcing the connectivity check. impalad_connectivity _tolerance 180 SECONDS 58 Cloudera Manager 4.6 Health Checks

69 Impalad File Descriptor Impalad File Descriptor Details: This Impala Daemon health check checks that the number of file descriptors used does not rise above some percentage of the Impala Daemon file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Impala Daemon monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. impalad_fd_ critical: , warning: Impalad Host Health Details: This Impala Daemon health check factors in the health of the host upon which the Impala Daemon is running. A failure of this check means that the host running the Impala Daemon is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Impala Daemon Host Health Check Impala Daemon monitoring setting. Short Name: Host Health Impala Daemon Host Health Check When computing the overall Impala Daemon health, consider the host's health. impalad_host_health_ enabled Impalad Log Directory Free Space Details: This Impala Daemon health check checks that the filesystem containing the log directory of this Impala Daemon has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Impala Daemon monitoring settings. Short Name: Log Directory Free Space Cloudera Manager 4.6 Health Checks 59

70 Impalad Memory Resident Set Size Health Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Impalad Memory Resident Set Size Health Details: This Impala Daemon health check checks that the size of the resident set does not rise above a configured threshold value. A failure of this health check may indicates that the Impala Daemon process is consuming more memory than expected. It is possible that that unexpected memory consumption may lead to swapping and decreased performance for processes running on the same host as this Impala Daemon. Increased Impala Daemon memory consumption may be caused by an increased workload on the Impala service, or by a bug in the Impala Daemon software. To avoid failures of this health check, free up additional memory for this Impala Daemon process and increase the Resident Set Size monitoring setting. This test can be configured using the Resident Set Size Impala Daemon monitoring setting. Short Name: Resident Set Size Resident Set Size The health check on the resident size of the process. process_resident_set _size_ critical:never, warning:never BYTES 60 Cloudera Manager 4.6 Health Checks

71 Impalad Cloudera Manager Agent Health Impalad Cloudera Manager Agent Health Details: This Impala Daemon health check checks that the Cloudera Manager Agent on the Impala Daemon host is heart beating correctly and that the process associated with the Impala Daemon role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Impala Daemon process, a lack of connectivity to the Cloudera Manager Agent on the Impala Daemon host, or a problem with the Cloudera Manager Agent. This check can fail either because the Impala Daemon has crashed or because the Impala Daemon will not start or stop in a timely fashion. Check the Impala Daemon logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Impala Daemon host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Impala Daemon host, or look in the Cloudera Manager Agent logs on the Impala Daemon host for more details. This test can be enabled or disabled using the Impala Daemon Process Health Check Impala Daemon monitoring setting. Short Name: Process Status Impala Daemon Process Health Check Enables the health check that the Impala Daemon's process state is consistent with the role configuration impalad_scm_health _enabled Impalad Unexpected Exits Details: This Impala Daemon health check checks that the Impala Daemon has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Impala Daemon monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a unexpected_exits_ critical:any, warning:never Cloudera Manager 4.6 Health Checks 61

72 Impalad Web Metric Collection recent period specified by the unexpected_exits_window configuration for the role. Impalad Web Metric Collection Details: This Impala Daemon health check checks that the web server of the Impala Daemon is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Impala Daemon, a misconfiguration of the Impala Daemon or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Impala Daemon for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Impala Daemon's web server are failing or timing out. These requests are completely local to the Impala Daemon's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Impala Daemon's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Impala Daemon monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. impalad_web_metric_ collection_enabled JobTracker File Descriptor Details: This JobTracker health check checks that the number of file descriptors used does not rise above some percentage of the JobTracker file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring JobTracker monitoring setting. Short Name: File Descriptors 62 Cloudera Manager 4.6 Health Checks

73 JobTracker Garbage Collection Duration File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. jobtracker_fd_ critical: , warning: JobTracker Garbage Collection Duration Details: This JobTracker health check checks that the JobTracker is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the JobTracker. This test can be configured using the JobTracker Garbage Collection Duration and JobTracker Garbage Collection Duration Monitoring Period JobTracker monitoring settings. Short Name: GC Duration JobTracker Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. jobtracker_gc_ duration_window 5 MINUTES JobTracker Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See JobTracker Garbage Collection Duration Monitoring Period. jobtracker_gc_ duration_ critical: , warning: JobTracker Host Health Details: This JobTracker health check factors in the health of the host upon which the JobTracker is running. A failure of this check means that the host running the JobTracker is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Cloudera Manager 4.6 Health Checks 63

74 JobTracker Log Directory Free Space JobTracker Host Health Check JobTracker monitoring setting. Short Name: Host Health JobTracker Host Health Check When computing the overall JobTracker health, consider the host's health. jobtracker_host_ health_enabled JobTracker Log Directory Free Space Details: This JobTracker health check checks that the filesystem containing the log directory of this JobTracker has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage JobTracker monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free _space_absolute_ critical: BYTES 00000, warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free _space_ percentage_ critical:never, warning:never 64 Cloudera Manager 4.6 Health Checks

75 JobTracker Cloudera Manager Agent Health JobTracker Cloudera Manager Agent Health Details: This JobTracker health check checks that the Cloudera Manager Agent on the JobTracker host is heart beating correctly and that the process associated with the JobTracker role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the JobTracker process, a lack of connectivity to the Cloudera Manager Agent on the JobTracker host, or a problem with the Cloudera Manager Agent. This check can fail either because the JobTracker has crashed or because the JobTracker will not start or stop in a timely fashion. Check the JobTracker logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the JobTracker host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the JobTracker host, or look in the Cloudera Manager Agent logs on the JobTracker host for more details. This test can be enabled or disabled using the JobTracker Process Health Check JobTracker monitoring setting. Short Name: Process Status JobTracker Process Health Check Enables the health check that the JobTracker's process state is consistent with the role configuration jobtracker_scm_ health_enabled JobTracker Unexpected Exits Details: This JobTracker health check checks that the JobTracker has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period JobTracker monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a unexpected_exits_ critical:any, warning:never Cloudera Manager 4.6 Health Checks 65

76 JobTracker Web Metric Collection recent period specified by the unexpected_exits_window configuration for the role. JobTracker Web Metric Collection Details: This JobTracker health check checks that the web server of the JobTracker is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the JobTracker, a misconfiguration of the JobTracker or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the JobTracker for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the JobTracker's web server are failing or timing out. These requests are completely local to the JobTracker's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the JobTracker's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection JobTracker monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. jobtracker_web_ metric_collection_ enabled JournalNode Edits Directory Free Space Details: This is a JournalNode health check that checks that the filesystem containing the edits directory of this JournalNode has sufficient free space. This test can be configured using the Edits Directory Free Space Monitoring Absolute and Edits Directory Free Space Monitoring Percentage JournalNode monitoring settings. Short Name: Edits Directory Free Space 66 Cloudera Manager 4.6 Health Checks

77 JournalNode File Descriptor Edits Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the JournalNode's edits directory. journalnode_edits_ directory_free_ space_absolute_ critical: , warning: BYTES Edits Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the JournalNode's edits directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Edits Directory Free Space Monitoring Absolute setting is configured. journalnode_edits_ directory_free_ space_percentage_ critical:never, warning:never JournalNode File Descriptor Details: This JournalNode health check checks that the number of file descriptors used does not rise above some percentage of the JournalNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring JournalNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. journalnode_fd_ critical: , warning: Cloudera Manager 4.6 Health Checks 67

78 JournalNode Garbage Collection Duration JournalNode Garbage Collection Duration Details: This JournalNode health check checks that the JournalNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the JournalNode. This test can be configured using the JournalNode Garbage Collection Duration and JournalNode Garbage Collection Duration Monitoring Period JournalNode monitoring settings. Short Name: GC Duration JournalNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. journalnode_gc_ duration_window 5 MINUTES JournalNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See JournalNode Garbage Collection Duration Monitoring Period. journalnode_gc_ duration_ critical: , warning: JournalNode Host Health Details: This JournalNode health check factors in the health of the host upon which the JournalNode is running. A failure of this check means that the host running the JournalNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the JournalNode Host Health Check JournalNode monitoring setting. Short Name: Host Health JournalNode Host Health Check When computing the overall JournalNode health, consider the host's health. journalnode_host_ health_enabled 68 Cloudera Manager 4.6 Health Checks

79 JournalNode Log Directory Free Space JournalNode Log Directory Free Space Details: This JournalNode health check checks that the filesystem containing the log directory of this JournalNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage JournalNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never JournalNode Cloudera Manager Agent Health Details: This JournalNode health check checks that the Cloudera Manager Agent on the JournalNode host is heart beating correctly and that the process associated with the JournalNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the JournalNode process, a lack of connectivity to the Cloudera Manager Agent on the JournalNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the JournalNode has crashed or because the JournalNode will not start or stop in a timely fashion. Check the JournalNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the JournalNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the JournalNode host, or look in the Cloudera Manager Agent logs on the JournalNode host for more details. This test can be enabled or disabled using the Cloudera Manager 4.6 Health Checks 69

80 JournalNode Sync Status JournalNode Process Health Check JournalNode monitoring setting. Short Name: Process Status JournalNode Process Health Check Enables the health check that the JournalNode's process state is consistent with the role configuration journalnode_scm_ health_enabled JournalNode Sync Status Details: This is a JournalNode health check that checks that the active NameNode is in sync with this JournalNode. This check returns "Bad" health if the active NameNode is out-of-sync with the JournalNode. This check is disabled when there is no active NameNode. This test can be configured using the Active NameNode Sync Status Health Check and Active NameNode Sync Status Startup Tolerance JournalNode monitoring settings. Short Name: Sync Status Active NameNode Sync Status Health Check Enables the health check that verifies the active NameNode's sync status to the JournalNode journalnode_sync_status_ enabled Active NameNode Sync Status Startup Tolerance The amount of time at JournalNode startup allowed for the active NameNode to get in sync with the JournalNode. journalnode_sync_status_ startup_tolerance 180 SECONDS JournalNode Unexpected Exits Details: This JournalNode health check checks that the JournalNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected 70 Cloudera Manager 4.6 Health Checks

81 JournalNode Web Metric Collection exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period JournalNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never JournalNode Web Metric Collection Details: This JournalNode health check checks that the web server of the JournalNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the JournalNode, a misconfiguration of the JournalNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the JournalNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the JournalNode's web server are failing or timing out. These requests are completely local to the JournalNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the JournalNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection JournalNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics journalnode_web_ metric_collection_ enabled Cloudera Manager 4.6 Health Checks 71

82 MapReduce High Availability JobTracker Health from the web server. MapReduce High Availability JobTracker Health Details: This is an MapReduce service-level health check that checks for the presence of an active, running and healthy JobTracker. The check returns "Bad" health if the service is running and a running, active JobTracker cannot be found. In all other cases it returns the health of the running, active JobTracker. A failure of this health check may indicate stopped or unhealthy JobTracker roles, the need to issue a failover command to make some JobTracker active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more JobTrackers. Check the status of the MapReduce service's JobTracker roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the JobTracker Role Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active MapReduce JobTracker before this health check fails, and the JobTracker Activation Startup Tolerance can be used to adjust the amount of time around JobTracker startup that the check allows for a JobTracker to be made active. Short Name: Active JobTracker Health Active JobTracker Detection Window The tolerance window that will be used in MapReduce service tests that depend on detection of the active JobTracker. mapreduce_active_ jobtracker_ detecton_window 3 MINUTES JobTracker Activation Startup Tolerance The amount of time after JobTracker(s) start that the lack of an active JobTracker will be tolerated. This is intended to allow either the auto-failover daemon to make a JobTracker active, or a specifically issued failover command to take effect. mapreduce_jobtrac ker_activation_ startup_tolerance 180 SECONDS JobTracker Role When computing the overall MapReduce cluster health, mapreduce_ jobtracker_health_ 72 Cloudera Manager 4.6 Health Checks

83 MapReduce Job Failure Ratio Health Check consider the JobTracker's health enabled MapReduce Job Failure Ratio Details: This is a MapReduce service-level health check that checks that no more than some percentage of recently completed jobs have failed. A failure of this health check may indicate problems with the MapReduce service or with the failing jobs. Check the status of the MapReduce service for more details. This test can be configured using the Job Failure Ratio, Job Failure Ratio Minimum Failing Jobs and Job Failure Ratio Monitoring Period MapReduce service-wide monitoring setting. Short Name: Job Failure Ratio Job Failure Ratio Minimum Failing Jobs The minimum number of mapreduce_job_failure 0 jobs that must fail during the test time period before the threshold values will be checked. Until this number of jobs have failed in the test time period the health check will continue to return good health. _ratio_minimum_jobs Job Failure Ratio Monitoring Period The time period to review when computing job failure ratio. Specified in minutes. mapreduce_job_failure _ratio_window 5 MINUTES Job Failure Ratio The health check of the number of recently failed jobs. Specified as a percentage of recently completed jobs. See Job Failure Ratio Monitoring Period. mapreduce_job_failure _ratio_ critical:never, warning:never Cloudera Manager 4.6 Health Checks 73

84 MapReduce JobTracker Health MapReduce JobTracker Health Details: This MapReduce service-level health check checks for the presence of a running, healthy JobTracker. The check returns "Bad" health if the service is running and the JobTracker is not running. In all other cases it returns the health of the JobTracker. A failure of this health check indicates a stopped or unhealthy JobTracker. Check the status of the JobTracker for more information. This test can be enabled or disabled using the JobTracker Role Health Check JobTracker service-wide monitoring setting. Short Name: JobTracker Health JobTracker Role Health Check When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_jobtracker _health_enabled MapReduce Maps Locality Details: This is a MapReduce service-level health check that checks that no more than some percentage of recently completed maps were operating on rack-local or other-local data. The check returns "Concerning" health if the number of rack-local maps is above a configured minimum number of maps and greater than the warning threshold or if the number of other-local maps is above a configured minimum number of maps and greater than the warning threshold. The test never returns "Bad" health. A failure of this health check may indicate problems with the configuration of the MapReduce service. Consider using the fair-scheduler and changing its delay configuration mapred.fairscheduler.locality.delay. In some scenarios, it may be normal to have a large number of non-local maps. For example, data import maps are always non-local. In such scenarios, consider disabling one or more of the used by this test. This test can be configured using the Rack- Local Map Task, Maps Locality Minimum Rack-Local Maps, Other-Local Map Task, Maps Locality Minimum Other-Local Maps and Map Tasks Locality Monitoring Period MapReduce service-wide monitoring settings. Short Name: Map Task Locality Map Tasks Locality Monitoring Period The time period to monitor when computing health test results for map tasks locality. Specified in minutes. mapreduce_maps_ locality_window 15 MINUTES 74 Cloudera Manager 4.6 Health Checks

85 MapReduce Maps Locality Maps Locality Minimum Other- Local Maps The minimum number of non-local maps that must complete during the test time period before the threshold values will be checked. Until this number of non-local maps have completed in the test time period the health check will continue to return good health. mapreduce_maps_ 0 locality_minimum_ other_locality_ maps Maps Locality Minimum Rack- Local Maps The minimum number of rack-local maps that must complete during the test time period before the threshold values will be checked. Until this number of rack-local maps have completed in the test time period the health check will continue to return good health. mapreduce_maps_ 0 locality_minimum_ rack_local_maps Other-Local Map Task The health check of the number of map tasks using non-local data. Specified as a percentage of other-local map tasks in the total number of map tasks. mapreduce_other_ local_ critical:never, warning:never Rack-Local Map Task The health check of the number of map tasks using non-local data. Specified as a percentage of rack-local map tasks in the total number of map tasks. mapreduce_rack_ local_ critical:never, warning:never Cloudera Manager 4.6 Health Checks 75

86 MapReduce Map Backlog MapReduce Map Backlog Details: This is a MapReduce service-level health check that checks that the number of waiting map tasks in the cluster does not rise above some percentage of the number of total available map slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting map tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Map Task Backlog MapReduce service-wide monitoring setting. Short Name: Map Task Backlog MapReduce Map Task Backlog The health check of the number of map tasks in the backlog. Specified as a percentage of the total number of map slots. mapreduce_map_ backlog_ critical:never, warning:never MapReduce Reduce Backlog Details: This is a MapReduce service-level health check that checks that the number of waiting reduce tasks in the cluster does not rise above some percentage of the number of total available reduce slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting reduce tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Reduce Task Backlog MapReduce service-wide monitoring setting. Short Name: Reduce Task Backlog MapReduce Reduce Task Backlog The health check for the number of reduce tasks in the backlog. Specified as a percentage of the total number of reduce slots. mapreduce_reduce_ backlog_ critical:never, warning:never 76 Cloudera Manager 4.6 Health Checks

87 MapReduce Standby Jobtrackers Health MapReduce Standby Jobtrackers Health Details: This is an MapReduce service-level health check that checks for a running, healthy JobTracker in standby mode. The check is disabled if the MapReduce service is not configured with multiple JobTrackers. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no JobTracker running in standby mode. Second, if the running standby JobTracker is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy JobTrackers, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the MapReduce JobTrackers. Check the status of the MapReduce service's JobTracker roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby JobTracker Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active JobTracker before this health check fails. Short Name: Standby JobTracker Health Active JobTracker Detection Window The tolerance window that will be used in MapReduce service tests that depend on detection of the active JobTracker. mapreduce_active_ jobtracker_detecton _window 3 MINUTES Standby JobTracker Health Check When computing the overall cluster health, consider the health of the standby JobTracker. mapreduce_standby _jobtrackers_health_ enabled MapReduce TaskTrackers Health Details: This is a MapReduce service-level health check that checks that enough of the TaskTrackers in the cluster are healthy. The check returns "Concerning" health if the number of healthy TaskTrackers falls below a warning threshold, expressed as a percentage of the total number of TaskTrackers. The check returns "Bad" health if the number of healthy and "Concerning" TaskTrackers falls below a critical threshold, expressed as a percentage of the total number of TaskTrackers. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 TaskTrackers, this check would return "Good" health if 95 or more TaskTrackers have good health. This check would return "Concerning" health if at least 90 TaskTrackers have either "Good" or "Concerning" health. If more than 10 TaskTrackers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy TaskTrackers. Check the status of the individual TaskTrackers for Cloudera Manager 4.6 Health Checks 77

88 Master Canary Health more information. This test can be configured using the Healthy TaskTracker Monitoring MapReduce service-wide monitoring setting. Short Name: TaskTrackers Health Healthy TaskTracker Monitoring The health check of the overall TaskTrackers health. The check returns "Concerning" health if the percentage of "Healthy" TaskTrackers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" TaskTrackers falls below the critical threshold. mapreduce_ critical: , tasktrackers_ warning: healthy_ Master Canary Health Details: This is an HBase Master health check that checks that a client can connect to and get basic information from the Master in a reasonable amount of time. The check returns "Bad" health if the connection to or basic queries of the Master fail. The check returns "Concerning" health if the connection attempt or queries do not complete in a reasonable time. A failure of this health check may indicate that the Master is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the Master, and look in the Master logs for more details. This test can be enabled or disabled using the HBase Master Canary Health Check HBase Master monitoring setting. Short Name: HBase Master Canary HBase Master Canary Health Check Enables the health check that a client can connect to the HBase Master master_canary_ health_enabled Master File Descriptor Details: This Master health check checks that the number of file descriptors used does not rise above some percentage of the Master file descriptor limit. A failure of this health check may indicate a bug in 78 Cloudera Manager 4.6 Health Checks

89 Master Garbage Collection Duration either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Master monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. master_fd_ critical: , warning: Master Garbage Collection Duration Details: This Master health check checks that the Master is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the Master. This test can be configured using the HBase Master Garbage Collection Duration and HBase Master Garbage Collection Duration Monitoring Period Master monitoring settings. Short Name: GC Duration HBase Master Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. master_gc_duration _window 5 MINUTES HBase Master Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See HBase Master Garbage Collection Duration Monitoring Period. master_gc_duration _ critical: , warning: Cloudera Manager 4.6 Health Checks 79

90 Master Host Health Master Host Health Details: This Master health check factors in the health of the host upon which the Master is running. A failure of this check means that the host running the Master is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Master Host Health Check Master monitoring setting. Short Name: Host Health HBase Master Host Health Check When computing the overall HBase Master health, consider the host's health. master_host_health_ enabled Master Log Directory Free Space Details: This Master health check checks that the filesystem containing the log directory of this Master has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Master monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring log_directory_free_ space_percentage_ critical:never, warning:never 80 Cloudera Manager 4.6 Health Checks

91 Master Cloudera Manager Agent Health Absolute setting is configured. Master Cloudera Manager Agent Health Details: This Master health check checks that the Cloudera Manager Agent on the Master host is heart beating correctly and that the process associated with the Master role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Master process, a lack of connectivity to the Cloudera Manager Agent on the Master host, or a problem with the Cloudera Manager Agent. This check can fail either because the Master has crashed or because the Master will not start or stop in a timely fashion. Check the Master logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Master host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Master host, or look in the Cloudera Manager Agent logs on the Master host for more details. This test can be enabled or disabled using the HBase Master Process Health Check Master monitoring setting. Short Name: Process Status HBase Master Process Health Check Enables the health check that the HBase Master's process state is consistent with the role configuration master_scm_health_ enabled Healthy HBase Region Servers Monitoring The health check hbase_regionservers of the overall HBase _healthy_ RegionServers health. The check returns "Concerning" health if the percentage of "Healthy" HBase RegionServers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" HBase RegionServers falls below the critical threshold. critical: , warning: Cloudera Manager 4.6 Health Checks 81

92 Master Unexpected Exits Master Unexpected Exits Details: This Master health check checks that the Master has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Master monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Master Web Metric Collection Details: This Master health check checks that the web server of the Master is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Master, a misconfiguration of the Master or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Master for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Master's web server are failing or timing out. These requests are completely local to the Master's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Master's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Master monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera master_web_metric_ collection_enabled 82 Cloudera Manager 4.6 Health Checks

93 Management Activity Monitor Health Manager Agent can successfully contact and gather metrics from the web server. Management Activity Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Activity Monitor. The check returns "Bad" health if the service is running and the Activity Monitor is not running. In all other cases it returns the health of the Activity Monitor. A failure of this health check indicates a stopped or unhealthy Activity Monitor. Check the status of the Activity Monitor for more information. This test can be enabled or disabled using the Activity Monitor Role Health Check Activity Monitor service-wide monitoring setting. Short Name: Activity Monitor Health Activity Monitor Role Health Check When computing the overall Management Service health, consider Activity Monitor's health mgmt_activitymonitor _health_enabled Management Alert Publisher Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Alert Publisher. The check returns "Bad" health if the service is running and the Alert Publisher is not running. In all other cases it returns the health of the Alert Publisher. A failure of this health check indicates a stopped or unhealthy Alert Publisher. Check the status of the Alert Publisher for more information. This test can be enabled or disabled using the Alert Publisher Role Health Check Alert Publisher service-wide monitoring setting. Short Name: Alert Publisher Health Alert Publisher Role Health Check When computing the overall Management Service health, consider Alert Publisher's health mgmt_alertpublisher_ health_enabled Cloudera Manager 4.6 Health Checks 83

94 Management Event Server Health Management Event Server Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Event Server. The check returns "Bad" health if the service is running and the Event Server is not running. In all other cases it returns the health of the Event Server. A failure of this health check indicates a stopped or unhealthy Event Server. Check the status of the Event Server for more information. This test can be enabled or disabled using the Event Server Role Health Check Event Server service-wide monitoring setting. Short Name: Event Server Health Event Server Role Health Check When computing the overall Management Service health, consider Event Server's health mgmt_eventserver _health_enabled Management Host Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Host Monitor. The check returns "Bad" health if the service is running and the Host Monitor is not running. In all other cases it returns the health of the Host Monitor. A failure of this health check indicates a stopped or unhealthy Host Monitor. Check the status of the Host Monitor for more information. This test can be enabled or disabled using the Host Monitor Role Health Check Host Monitor service-wide monitoring setting. Short Name: Host Monitor Health Host Monitor Role Health Check When computing the overall Management service health, consider the Host Monitor's health mgmt_hostmonitor _health_enabled Management Navigator Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Navigator Server. The check returns "Bad" health if the service is running and the Navigator Server is not running. In all other cases it returns the health of the Navigator Server. A failure of this health check indicates a stopped or unhealthy Navigator Server. Check the status of the Navigator Server for more information. This test can be enabled or disabled using the Navigator Role Health Check 84 Cloudera Manager 4.6 Health Checks

95 Management Reports Manager Health Navigator Server service-wide monitoring setting. Short Name: Navigator Health Navigator Role Health Check When computing the overall Management Service health, consider Navigator's health mgmt_navigator_ health_enabled Management Reports Manager Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Reports Manager. The check returns "Bad" health if the service is running and the Reports Manager is not running. In all other cases it returns the health of the Reports Manager. A failure of this health check indicates a stopped or unhealthy Reports Manager. Check the status of the Reports Manager for more information. This test can be enabled or disabled using the Reports Manager Role Health Check Reports Manager service-wide monitoring setting. Short Name: Reports Manager Health Reports Manager Role Health Check When computing the overall Management Service health, consider Reports Manager's health mgmt_reportsmanager _health_enabled Management Service Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Service Monitor. The check returns "Bad" health if the service is running and the Service Monitor is not running. In all other cases it returns the health of the Service Monitor. A failure of this health check indicates a stopped or unhealthy Service Monitor. Check the status of the Service Monitor for more information. This test can be enabled or disabled using the Service Monitor Role Health Check Service Monitor service-wide monitoring setting. Short Name: Service Monitor Health Service Monitor When computing the mgmt_servicemonitor Cloudera Manager 4.6 Health Checks 85

96 NameNode Checkpoint Age Role Health Check overall Management Service health, consider Service Monitor's health _health_enabled NameNode Checkpoint Age Details: This is a NameNode health check that checks the NameNode's filesystem checkpoint is no older than some percentage of the Filesystem Checkpoint Period. A failure of this health check may indicate a problem with the NameNode or the SecondaryNameNode. Check the NameNode or the SecondaryNameNode logs for more details. This test can be configured using the Filesystem Checkpoint Age Monitoring NameNode monitoring setting. Short Name: Checkpoint Status Filesystem Checkpoint Age Monitoring The health check namenode_checkpoint of the age of _age_ the HDFS namespace checkpoint. Specified as a percentage of the configured checkpoint interval. critical: , warning: NameNode Data Directories Free Space Details: This is a NameNode health check that checks that the filesystems containing the data directories have sufficient free space. This test can be configured using the Data Directories Free Space Monitoring Percentage and NameNode monitoring settings. Short Name: Data Directories Free Space Data Directories Free Space Monitoring Absolute The health check for monitoring of free space on the filesystems that contain this role's data directories. namenode_data_ directories_free_ space_absolute_ critical: BYTES 00000, warning: Cloudera Manager 4.6 Health Checks

97 NameNode Directory Failures Data Directories Free Space Monitoring Percentage The health check namenode_data_ for directories_free_ monitoring of free space space_percentage_ on the filesystems that contain this role's data directories. Specified as a percentage of the capacity on the filesystem. This setting is not used if a Data Directories Free Space Monitoring Absolute setting is configured. critical:never, warning:never NameNode Directory Failures Details: This is a NameNode health check that checks for whether the NameNode has reported any failed data directories. A failure of this health check indicates that there is a problem with one or more data directories on the NameNode. See the NameNode system web UI for more information. Note that unless the configuration dfs.namenode.name.dir.restore is configured to, the NameNode will require a restart to recognize data directories that have been restored (e.g., after an NFS outage). This test can be configured using the NameNode Directory Failures NameNode monitoring setting. Short Name: Name Directory Status NameNode Directory Failures The health check of failed status directories in a NameNode. namenode_directory _failures_ critical:any, warning:never NameNode File Descriptor Details: This NameNode health check checks that the number of file descriptors used does not rise above some percentage of the NameNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be Cloudera Manager 4.6 Health Checks 87

98 NameNode Garbage Collection Duration configured using the File Descriptor Monitoring NameNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. namenode_fd_ critical: , warning: NameNode Garbage Collection Duration Details: This NameNode health check checks that the NameNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the NameNode. This test can be configured using the NameNode Garbage Collection Duration and NameNode Garbage Collection Duration Monitoring Period NameNode monitoring settings. Short Name: GC Duration NameNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. namenode_gc_ duration_window 5 MINUTES NameNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See NameNode Garbage Collection Duration Monitoring Period. namenode_gc_ critical: , duration_ warning: Cloudera Manager 4.6 Health Checks

99 NameNode High Availability Checkpoint Age NameNode High Availability Checkpoint Age Details: This is a NameNode health check that checks the NameNode's filesystem checkpoint is no older than some percentage of the Filesystem Checkpoint Period and that the number of transactions that have occurred since the last filesystem checkpoint does not exceed some percentage of the Filesystem Checkpoint Transaction Limit. A failure of this health check may indicate a problem with the active NameNode or its configured checkpointing role, i.e. either a standby NameNode or a SecondaryNameNode. Check the NameNode logs or the logs of the checkpointing role for more details. This test can be configured using the Filesystem Checkpoint Age Monitoring and Filesystem Checkpoint Transactions Monitoring NameNode monitoring setting. Short Name: Checkpoint Status Filesystem Checkpoint Age Monitoring The health check namenode_checkpoint of the age of _age_ the HDFS namespace checkpoint. Specified as a percentage of the configured checkpoint interval. critical: , warning: Filesystem Checkpoint Transactions Monitoring The health check of the number of transactions since the last HDFS namespace checkpoint. Specified as a percentage of the configured checkpointing transaction limit. namenode_checkpoint _transactions_ critical: , warning: NameNode Host Health Details: This NameNode health check factors in the health of the host upon which the NameNode is running. A failure of this check means that the host running the NameNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the NameNode Host Health Check NameNode monitoring setting. Short Name: Host Health Cloudera Manager 4.6 Health Checks 89

100 NameNode JournalNode Sync Status NameNode Host Health Check When computing the overall NameNode health, consider the host's health. namenode_host_ health_enabled NameNode JournalNode Sync Status Details: This is a NameNode health check that checks that the NameNode is not out-of-sync with too many JournalNodes. This check is disabled if Quorum-based storage is not in use.a failure of this health check may indicate a problem with the JournalNodes or a communication problem between the NameNode and the JournalNodes. Check the NameNode and the JournalNode logs for more details.this test can be configured using the NameNode Out-Of-Sync JournalNodes NameNode monitoring setting. Short Name: JournalNode Sync Status NameNode Out-Of- Sync JournalNodes The health check namenode_out_of for the number of out-of-sync _sync_journal_ JournalNodes for this nodes_ NameNode. critical:any, warning:never NameNode Log Directory Free Space Details: This NameNode health check checks that the filesystem containing the log directory of this NameNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage NameNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES 90 Cloudera Manager 4.6 Health Checks

101 NameNode RPC Latency Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never NameNode RPC Latency Details: This is a NameNode health check that checks that the average time it takes for the NameNode to respond to requests does not exceed some value. A failure of this health check could indicate misconfiguration of the NameNode, that the NameNode is having a problem writing to one of its data directories, or may indicate a capacity planning problem. Check the NameNode's RpcQueueTime_avg_time and if this indicates that the bulk of the RPC latency is spent with requests queued, try increasing the NameNode Handler Count. If the NameNode's RpcProcessingTime_avg_time indicates the bulk of the RPC latency is due to request processing, check to see that each of the directories in which the HDFS metadata is being stored is performing adequately. This test can be configured using the NameNode RPC Latency NameNode monitoring setting. Short Name: RPC Latency NameNode RPC Latency The health check of the NameNode's RPC latency namenode_rpc_ latency_ critical: , warning: MILLISECONDS NameNode Safe Mode Details: This is a NameNode health check that checks that the NameNode is not in safemode. A failure of this health check indicates that the NameNode is in safemode. Look in the NameNode logs for more details. This test can be enabled or disabled using the NameNode Safemode Health Check NameNode Cloudera Manager 4.6 Health Checks 91

102 NameNode Cloudera Manager Agent Health monitoring setting. The Health Check Startup Tolerance NameNode monitoring setting also controls the amount of time at NameNode start up to tolerate safemode. Short Name: Safe Mode Status Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. namenode_startup_ tolerance 5 MINUTES NameNode Safemode Health Check Enables the health check that the NameNode is not in safemode namenode_safe_ mode_enabled NameNode Cloudera Manager Agent Health Details: This NameNode health check checks that the Cloudera Manager Agent on the NameNode host is heart beating correctly and that the process associated with the NameNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the NameNode process, a lack of connectivity to the Cloudera Manager Agent on the NameNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the NameNode has crashed or because the NameNode will not start or stop in a timely fashion. Check the NameNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the NameNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the NameNode host, or look in the Cloudera Manager Agent logs on the NameNode host for more details. This test can be enabled or disabled using the NameNode Process Health Check NameNode monitoring setting. Short Name: Process Status NameNode Process Health Check Enables the health check that the NameNode's process state is consistent with the role configuration namenode_scm_ health_enabled 92 Cloudera Manager 4.6 Health Checks

103 NameNode Unexpected Exits NameNode Unexpected Exits Details: This NameNode health check checks that the NameNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period NameNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never NameNode Upgrade Status Details: This is a NameNode health check that checks for the presence of an unfinalized HDFS metadata upgrade. If an unfinalized HDFS metadata upgrade is detected, this check returns "Concerning" health. A failure of this health check indicates that a previously performed HDFS upgrade needs to be finalized. This can be done via the Finalize Metadata Upgrade NameNode command using Cloudera Manager. This test can be enabled or disabled using the HDFS Upgrade Status Health Check NameNode monitoring setting. Short Name: Upgrade Status HDFS Upgrade Enables the health check of Status Health Check the upgrade status of the NameNode. namenode_upgrade _status_enabled Cloudera Manager 4.6 Health Checks 93

104 NameNode Web Metric Collection NameNode Web Metric Collection Details: This NameNode health check checks that the web server of the NameNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the NameNode, a misconfiguration of the NameNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the NameNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the NameNode's web server are failing or timing out. These requests are completely local to the NameNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the NameNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection NameNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. namenode_web_ metric_collection_ enabled Navigator File Descriptor Details: This Navigator Server health check checks that the number of file descriptors used does not rise above some percentage of the Navigator Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Navigator Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. navigator_fd_ critical: , warning: Cloudera Manager 4.6 Health Checks

105 Navigator Host Health Navigator Host Health Details: This Navigator Server health check factors in the health of the host upon which the Navigator Server is running. A failure of this check means that the host running the Navigator Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Navigator Host Health Check Navigator Server monitoring setting. Short Name: Host Health Navigator Host Health Check When computing the overall Navigator health, consider the host's health. navigator_host_ health_enabled Navigator Log Directory Free Space Details: This Navigator Server health check checks that the filesystem containing the log directory of this Navigator Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Navigator Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: BYTES , warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is log_directory_free_ space_percentage_ critical:never, warning:never Cloudera Manager 4.6 Health Checks 95

106 Navigator Cloudera Manager Agent Health configured. Navigator Cloudera Manager Agent Health Details: This Navigator Server health check checks that the Cloudera Manager Agent on the Navigator Server host is heart beating correctly and that the process associated with the Navigator Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Navigator Server process, a lack of connectivity to the Cloudera Manager Agent on the Navigator Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Navigator Server has crashed or because the Navigator Server will not start or stop in a timely fashion. Check the Navigator Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Navigator Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Navigator Server host, or look in the Cloudera Manager Agent logs on the Navigator Server host for more details. This test can be enabled or disabled using the Navigator Process Health Check Navigator Server monitoring setting. Short Name: Process Status Navigator Process Health Check Enables the health check that the Navigator's process state is consistent with the role configuration navigator_scm_ health_enabled Navigator Unexpected Exits Details: This Navigator Server health check checks that the Navigator Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Navigator Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits _window 5 MINUTES 96 Cloudera Manager 4.6 Health Checks

107 RegionServer Compaction Queue Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits _ critical:any, warning:never RegionServer Compaction Queue Details: This is a RegionServer health check that checks that a moving average of the size of the RegionServer's compaction queue does not exceed some value. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Compaction Queue Monitoring and HBase RegionServer Compaction Queue Monitoring Period RegionServer monitoring settings. Short Name: Compaction Queue Size HBase RegionServer Compaction Queue Monitoring Period The period over which to compute the moving average of the compaction queue size. regionserver_ compaction_queue _window 5 MINUTES HBase RegionServer Compaction Queue Monitoring The health check of the weighted average size of the HBase RegionServer compaction queue over a recent period. See HBase RegionServer Compaction Queue Monitoring Period. regionserver_ compaction_queue _ critical:never, warning: RegionServer File Descriptor Details: This RegionServer health check checks that the number of file descriptors used does not rise above some percentage of the RegionServer file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be Cloudera Manager 4.6 Health Checks 97

108 RegionServer Flush Queue configured using the File Descriptor Monitoring RegionServer monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. regionserver_fd_ critical: , warning: RegionServer Flush Queue Details: This is a RegionServer health check that checks that a moving average of the size of the RegionServer's flush queue does not exceed some value. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Flush Queue Monitoring and HBase RegionServer Flush Queue Monitoring Period RegionServer monitoring settings. Short Name: Flush Queue Size HBase RegionServer Flush Queue Monitoring Period The period over which to compute the moving average of the flush queue size. regionserver_flush_ queue_window 5 MINUTES HBase RegionServer Flush Queue Monitoring The health check of the average size of the HBase RegionServer flush queue over a recent period. See HBase RegionServer Flush Queue Monitoring Period. regionserver_flush_ queue_ critical:never, warning: RegionServer Garbage Collection Duration Details: This RegionServer health check checks that the RegionServer is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent 98 Cloudera Manager 4.6 Health Checks

109 RegionServer Host Health performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the RegionServer. This test can be configured using the HBase RegionServer Garbage Collection Duration and HBase RegionServer Garbage Collection Duration Monitoring Period RegionServer monitoring settings. Short Name: GC Duration HBase Region Server Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. regionserver_gc_ duration_window 5 MINUTES HBase Region Server Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See HBase RegionServer Garbage Collection Duration Monitoring Period. regionserver_gc_ duration_ critical: , warning: RegionServer Host Health Details: This RegionServer health check factors in the health of the host upon which the RegionServer is running. A failure of this check means that the host running the RegionServer is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase RegionServer Host Health Check RegionServer monitoring setting. Short Name: Host Health HBase Region Server Host Health Check When computing the overall HBase RegionServer health, consider the host's health. regionserver_host _health_enabled Cloudera Manager 4.6 Health Checks 99

110 RegionServer Log Directory Free Space RegionServer Log Directory Free Space Details: This RegionServer health check checks that the filesystem containing the log directory of this RegionServer has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage RegionServer monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never RegionServer Master Connectivity Details: This is a RegionServer health check that checks that the Master considers the RegionServer alive. A failure of this health check may indicate that the RegionServer is having trouble communicating with at least the HBase Master and possibly the entire HBase cluster. Look in the RegionServer logs for more details. This test can be enabled or disabled using the HBase RegionServer to Master Connectivity Check RegionServer monitoring setting. The HBase RegionServer Connectivity Tolerance at Startup RegionServer monitoring setting and the Health Check Startup Tolerance Master monitoring setting can be used to control the check's tolerance windows around RegionServer and Master restarts respectively. Short Name: Cluster Connectivity 100 Cloudera Manager 4.6 Health Checks

111 RegionServer Memstore Size HBase Region Server Connectivity Tolerance at Startup The amount of time to wait for the HBase Region Server to fully start up and connect to the HBase Master before enforcing the connectivity check. regionserver_ connectivity_ tolerance 180 SECONDS HBase RegionServer to Master Connectivity Check Enables the health check that the RegionServer is connected to the Master regionserver_ master_ connectivity_ enabled Health Check Startup Tolerance The amount of time allowed master_startup_ after this role is started that tolerance failures of health checks that rely on communication with this role will be tolerated. 5 MINUTES RegionServer Memstore Size Details: This is a RegionServer health check that checks that the amount of the RegionServer's memory devoted to memstores does not exceed some percentage of the RegionServer's configured hbase.regionserver.global.memstore.upperlimit. When a RegionServer's memstores reach this maximum size, new updates are blocked while the RegionServer flushes. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Memstore Size RegionServer monitoring setting. Short Name: Memstore Size HBase RegionServer Memstore Size The health check of the total size of RegionServer's memstores. Specified as a percentage of the configured upper limit. See Maximum Size of All regionserver_ memstore_size_ critical: , warning: Cloudera Manager 4.6 Health Checks 101

112 RegionServer Read Latency Memstores in RegionServer. RegionServer Read Latency Details: This is a RegionServer health check that checks that the average time it takes for the RegionServer to read from HDFS does not exceed some value. The health check computes a moving average of the average HDFS read time for the RegionServer over a configurable window and then compares that moving average to a configured threshold value. A moving average is used to avoid alerting on short-lived HDFS read latency spikes. The behavior of this health check depends on how the HBase cluster is being used. A failure of this health check may indicate a problem with HDFS, extensive load on the cluster, or a capacity management problem. For example, it is possible to see average read latency increase while executing MapReduce jobs on the cluster. In such situations, if the read latency increase is unacceptable, consider running the MapReduce jobs with fewer map or reduce slots. This test can be configured using the HBase RegionServer HDFS Read Latency and HBase RegionServer HDFS Read Latency Monitoring Period RegionServer monitoring settings. Short Name: HDFS Read Latency HBase RegionServer HDFS Read Latency Monitoring Period The period over which to compute the moving average of the HDFS read latency of the HBase RegionServer. regionserver_read_ latency_window 5 MINUTES HBase RegionServer HDFS Read Latency The health check regionserver_read_ of the latency latency_ that the RegionServer sees for HDFS read operations critical: , warning: MILLISECONDS RegionServer Cloudera Manager Agent Health Details: This RegionServer health check checks that the Cloudera Manager Agent on the RegionServer host is heart beating correctly and that the process associated with the RegionServer role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the RegionServer process, a lack of connectivity to the Cloudera Manager Agent on the RegionServer host, or a problem with the Cloudera Manager Agent. This check can fail either because the RegionServer has 102 Cloudera Manager 4.6 Health Checks

113 RegionServer Store File Idx Size crashed or because the RegionServer will not start or stop in a timely fashion. Check the RegionServer logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the RegionServer host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the RegionServer host, or look in the Cloudera Manager Agent logs on the RegionServer host for more details. This test can be enabled or disabled using the HBase RegionServer Process Health Check RegionServer monitoring setting. Short Name: Process Status HBase Region Server Process Health Check Enables the health check that the HBase RegionServer's process state is consistent with the role configuration regionserver_scm_ health_enabled RegionServer Store File Idx Size Details: This is a RegionServer health check that checks that the sum of the sizes of all store file indexes does not exceed some percentage of the RegionServer's maximum heap size. A failure of this health check indicates that the RegionServer is using a significant portion of its memory for store file indexes. If the amount of memory devoted to these indexes is undesirably high, the size of indexes can be reduced by increasing the HBase block size, by using smaller key values, or by using fewer columns. Each of these choices involves trade-offs. Contact Cloudera Support for more information on this topic. This test can be configured using the Percentage of Heap Used by HStoreFile Index RegionServer monitoring setting. Short Name: Store File Index Size Percentage of Heap Used by HStoreFile Index The health check of the size used by the HStoreFile index. Specified as a percentage of the total heap size. regionserver_store _file_idx_size_ critical:never, warning: RegionServer Sync Latency Details: This is a RegionServer health check that checks that the average time it takes for the RegionServer to perform HDFS syncs does not exceed some value. The health check computes a moving average of the average HDFS sync time for the RegionServer over a configurable window and then Cloudera Manager 4.6 Health Checks 103

114 RegionServer Unexpected Exits compares that moving average to a configured threshold value. A moving average is used to avoid alerting on short-lived HDFS sync latency spikes. The behavior of this health check depends on how the HBase cluster is being used. A failure of this health check may indicate a problem with HDFS, extensive load on the cluster, or a capacity management problem. For example, it is possible to see average sync latency increase while executing MapReduce jobs on the cluster. In such situations, if the sync latency increase is unacceptable, consider running the MapReduce jobs with fewer map or reduce slots. This test can be configured using the HBase RegionServer HDFS Sync Latency and HBase RegionServer HDFS Sync Latency Monitoring Period RegionServer monitoring settings. Short Name: HDFS Sync Latency HBase RegionServer HDFS Sync Latency Monitoring Period The period over which to compute the moving average of the HDFS sync latency of the HBase RegionServer. regionserver_sync_ latency_window 5 MINUTES HBase RegionServer HDFS Sync Latency The health check for the latency of HDFS write operations that the RegionServer detects regionserver_sync_ latency_ critical: , warning: MILLISECONDS RegionServer Unexpected Exits Details: This RegionServer health check checks that the RegionServer has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period RegionServer monitoring settings. Short Name: Unexpected Exits 104 Cloudera Manager 4.6 Health Checks

115 RegionServer Web Metric Collection Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never RegionServer Web Metric Collection Details: This RegionServer health check checks that the web server of the RegionServer is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the RegionServer, a misconfiguration of the RegionServer or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the RegionServer for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the RegionServer's web server are failing or timing out. These requests are completely local to the RegionServer's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the RegionServer's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection RegionServer monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. regionserver_web_ metric_collection_ enabled Reports Manager File Descriptor Details: This Reports Manager health check checks that the number of file descriptors used does not rise above some percentage of the Reports Manager file descriptor limit. A failure of this health check may Cloudera Manager 4.6 Health Checks 105

116 Reports Manager Host Health indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Reports Manager monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. reportsmanager_fd _ critical: , warning: Reports Manager Host Health Details: This Reports Manager health check factors in the health of the host upon which the Reports Manager is running. A failure of this check means that the host running the Reports Manager is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Reports Manager Host Health Check Reports Manager monitoring setting. Short Name: Host Health Reports Manager Host Health Check When computing the overall Reports Manager health, consider the host's health. reportsmanager_host _health_enabled Reports Manager Log Directory Free Space Details: This Reports Manager health check checks that the filesystem containing the log directory of this Reports Manager has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Reports Manager monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the log_directory_free_ space_absolute_ critical: , warning: BYTES 106 Cloudera Manager 4.6 Health Checks

117 Reports Manager Cloudera Manager Agent Health filesystem that contains this role's log directory Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Reports Manager Cloudera Manager Agent Health Details: This Reports Manager health check checks that the Cloudera Manager Agent on the Reports Manager host is heart beating correctly and that the process associated with the Reports Manager role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Reports Manager process, a lack of connectivity to the Cloudera Manager Agent on the Reports Manager host, or a problem with the Cloudera Manager Agent. This check can fail either because the Reports Manager has crashed or because the Reports Manager will not start or stop in a timely fashion. Check the Reports Manager logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Reports Manager host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Reports Manager host, or look in the Cloudera Manager Agent logs on the Reports Manager host for more details. This test can be enabled or disabled using the Reports Manager Process Health Check Reports Manager monitoring setting. Short Name: Process Status Reports Manager Process Health Check Enables the health check that the Reports Manager's process state is consistent with the role reportsmanager_scm _health_enabled Cloudera Manager 4.6 Health Checks 107

118 Reports Manager Scratch Directory Free Space configuration Reports Manager Scratch Directory Free Space Details: This is a Reports Manager health check that checks that the filesystem containing the scratch directory of this Reports Manager has sufficient free space. This test can be configured using the Scratch Directory Free Space Monitoring Absolute and Scratch Directory Free Space Monitoring Percentage Reports Manager monitoring settings. Short Name: Scratch Directory Free Space Scratch Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the scratch directory. reportsmanager_ scratch_directory_ free_space_ absolute_ critical: BYTES , warning: Scratch Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the scratch directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Scratch Directory Free Space Monitoring Absolute setting is configured. reportsmanager_ scratch_directory_ free_space_ percentage_ critical:never, warning:never Reports Manager Unexpected Exits Details: This Reports Manager health check checks that the Reports Manager has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected 108 Cloudera Manager 4.6 Health Checks

119 Secondary NameNode Checkpoint Directories Free Space Exits and Unexpected Exits Monitoring Period Reports Manager monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Secondary NameNode Checkpoint Directories Free Space Details: This is a Secondary NameNode health check that checks that the filesystems containing the checkpoint directories have sufficient free space. This test can be configured using the Checkpoint Directories Free Space Monitoring Percentage and Secondary NameNode monitoring settings. Short Name: Checkpoint Directories Free Space Checkpoint Directories Free Space Monitoring Absolute The health check for monitoring of free space on the filesystems that contain this role's checkpoint directories. secondarynamenode_ checkpoint_directories_free_ space_absolute_ critical: , warning: BYTES Checkpoint Directories Free Space Monitoring Percentage The health check for monitoring of free space on the filesystems that contain this role's checkpoint secondarynamenode_ critical:never, checkpoint_directories_free_ warning:never space_percentage_ Cloudera Manager 4.6 Health Checks 109

120 Secondary NameNode File Descriptor directories. Specified as a percentage of the capacity on the filesystem. This setting is not used if a Checkpoint Directories Free Space Monitoring Absolute setting is configured. Secondary NameNode File Descriptor Details: This SecondaryNameNode health check checks that the number of file descriptors used does not rise above some percentage of the SecondaryNameNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring SecondaryNameNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. secondarynamenode _fd_ critical: , warning: Secondary NameNode Garbage Collection Duration Details: This SecondaryNameNode health check checks that the SecondaryNameNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the SecondaryNameNode. This test can be configured using the Secondary NameNode Garbage Collection Duration and Secondary NameNode 110 Cloudera Manager 4.6 Health Checks

121 Secondary NameNode Host Health Garbage Collection Duration Monitoring Period SecondaryNameNode monitoring settings. Short Name: GC Duration Secondary NameNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. secondarynamenode _gc_duration_ window 5 MINUTES Secondary NameNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See Secondary NameNode Garbage Collection Duration Monitoring Period. secondarynamenode _gc_duration_ critical: , warning: Secondary NameNode Host Health Details: This SecondaryNameNode health check factors in the health of the host upon which the SecondaryNameNode is running. A failure of this check means that the host running the SecondaryNameNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Secondary NameNode Host Health Check SecondaryNameNode monitoring setting. Short Name: Host Health Secondary NameNode Host Health Check When computing the secondarynamenode overall Secondary _host_health_ NameNode health, consider enabled the host's health. Cloudera Manager 4.6 Health Checks 111

122 Secondary NameNode Log Directory Free Space Secondary NameNode Log Directory Free Space Details: This SecondaryNameNode health check checks that the filesystem containing the log directory of this SecondaryNameNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage SecondaryNameNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Secondary NameNode Cloudera Manager Agent Health Details: This SecondaryNameNode health check checks that the Cloudera Manager Agent on the SecondaryNameNode host is heart beating correctly and that the process associated with the SecondaryNameNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the SecondaryNameNode process, a lack of connectivity to the Cloudera Manager Agent on the SecondaryNameNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the SecondaryNameNode has crashed or because the SecondaryNameNode will not start or stop in a timely fashion. Check the SecondaryNameNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the SecondaryNameNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the SecondaryNameNode host, or look in the Cloudera Manager Agent logs on the SecondaryNameNode host for more details. This test can be enabled or 112 Cloudera Manager 4.6 Health Checks

123 Secondary NameNode Unexpected Exits disabled using the Secondary NameNode Process Health Check SecondaryNameNode monitoring setting. Short Name: Process Status Secondary NameNode Process Health Check Enables the health check that the Secondary NameNode's process state is consistent with the role configuration secondarynamenode _scm_health_ enabled Secondary NameNode Unexpected Exits Details: This SecondaryNameNode health check checks that the SecondaryNameNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period SecondaryNameNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Secondary NameNode Web Metric Collection Details: This SecondaryNameNode health check checks that the web server of the SecondaryNameNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the Cloudera Manager 4.6 Health Checks 113

124 ZooKeeper Server Connection Count web server of the SecondaryNameNode, a misconfiguration of the SecondaryNameNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the SecondaryNameNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the SecondaryNameNode's web server are failing or timing out. These requests are completely local to the SecondaryNameNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the SecondaryNameNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection SecondaryNameNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. secondarynamenode _web_metric_ collection_enabled ZooKeeper Server Connection Count Details: This is a ZooKeeper Server role-level health check that checks that a moving average of the ZooKeeper Server's connection count does not exceed some value. A failure of this health check indicates a high connection load on the ZooKeeper Server. This test can be configured using the ZooKeeper Server Connection Count and ZooKeeper Server Connection Count Monitoring Period ZooKeeper Server monitoring settings. Short Name: Connection Count. ZooKeeper Server Connection Count Monitoring Period The period to review when computing the moving average of the connection count. Specified in minutes. zookeeper_server_ connection_count_ window 3 MINUTES ZooKeeper Server Connection Count The health check of the weighted average size of the ZooKeeper Server connection count over a recent period. See ZooKeeper Server zookeeper_server_ connection_count_ critical:never, warning:never 114 Cloudera Manager 4.6 Health Checks

125 ZooKeeper Server Data Directory Free Space Connection Count Monitoring Period. ZooKeeper Server Data Directory Free Space Details: This is a ZooKeeper Server health check that checks that the filesystem containing the data directory of this ZooKeeper Server has sufficient free space. The data directory contains the database snapshots of the ZooKeeper Server. This test can be configured using the Data Directory Free Space Monitoring Absolute and Data Directory Free Space Monitoring Percentage ZooKeeper Server monitoring settings. Short Name: Data Directory Free Space Data Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the ZooKeeper Server's data directory. zookeeper_server_ data_directory_free_ space_absolute_ critical: , warning: BYTES Data Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Data Directory Free Space Monitoring Absolute setting is configured. zookeeper_server_ critical:never, data_directory_free_ warning:never space_percentage_ Cloudera Manager 4.6 Health Checks 115

126 ZooKeeper Server Data Log Directory Free Space ZooKeeper Server Data Log Directory Free Space Details: This is a ZooKeeper Server health check that checks that the filesystem containing the data log directory of this ZooKeeper Server has sufficient free space. The data log directory contains the transaction logs of the ZooKeeper Server. This test can be configured using the Data Log Directory Free Space Monitoring Absolute and Data Log Directory Free Space Monitoring Percentage ZooKeeper Server monitoring settings. Short Name: Data Log Directory Free Space Data Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data log directory. zookeeper_server_ data_log_directory_ free_space_absolute _ critical: , warning: BYTES Data Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Data Log Directory Free Space Monitoring Absolute setting is configured. zookeeper_server_ data_log_directory_ free_space_ percentage_ critical:never, warning:never ZooKeeper Server File Descriptor Details: This Server health check checks that the number of file descriptors used does not rise above some percentage of the Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Server monitoring setting. Short Name: File Descriptors 116 Cloudera Manager 4.6 Health Checks

127 ZooKeeper Server Garbage Collection Duration File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. zookeeper_server_ fd_ critical: , warning: ZooKeeper Server Garbage Collection Duration Details: This Server health check checks that the Server is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the Server. This test can be configured using the ZooKeeper Server Garbage Collection Duration and ZooKeeper Server Garbage Collection Duration Monitoring Period Server monitoring settings. Short Name: GC Duration ZooKeeper Server Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. zookeeper_server_gc _duration_window 5 MINUTES ZooKeeper Server Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See ZooKeeper Server Garbage Collection Duration Monitoring Period. zookeeper_server_gc _duration_threshold s critical: , warning: ZooKeeper Server Host Health Details: This Server health check factors in the health of the host upon which the Server is running. A failure of this check means that the host running the Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the ZooKeeper Server Host Cloudera Manager 4.6 Health Checks 117

128 ZooKeeper Server Log Directory Free Space Health Check Server monitoring setting. Short Name: Host Health ZooKeeper Server Host Health Check When computing the overall ZooKeeper Server health, consider the host's health. zookeeper_server_ host_health_ enabled ZooKeeper Server Log Directory Free Space Details: This Server health check checks that the filesystem containing the log directory of this Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never 118 Cloudera Manager 4.6 Health Checks

129 ZooKeeper Server Max Latency ZooKeeper Server Max Latency Details: This is a ZooKeeper server health check that checks that the ratio of the maximum request latency to the maximum client-negotiable session timeout does not exceed some value. Note that the maximum client-negotiable timeout need not be the actual session timeout used by a client but is the upper bound of such client session timeouts. As a result, this check being in "Good" health does not preclude clients from experiencing timeouts based on the particular session timeout value they have negotiated with the server. A failure of this health check likely indicates a high load on the ZooKeeper Server. This test can be configured using the Maximum Latency Monitoring ZooKeeper Server monitoring setting. Short Name: Maximum Request Latency Maximum Latency Monitoring The percentage of the ratio of the maximum request latency to the maximum client-negotiable session timeout since the server was started. zookeeper_server_ max_latency_ critical: , warning: ZooKeeper Server Outstanding Requests Details: This is a ZooKeeper Server role-level health check that checks that a moving average of the size of the ZooKeeper Server's outstanding requests does not exceed some value. Outstanding requests are the number of queued requests in the server. It increases when the server is under load and is receiving more sustained requests than it can process. A failure of this health check indicates a high connection load on the ZooKeeper Server. This test can be configured using the ZooKeeper Server Outstanding Requests and ZooKeeper Server Outstanding Requests Monitoring Period ZooKeeper Server monitoring settings. Short Name: Outstanding Requests. ZooKeeper Server Outstanding Requests Monitoring Period The period to review when computing the moving average of the outstanding requests queue size. Specified in minutes. zookeeper_server_ outstanding_ requests_window 3 MINUTES ZooKeeper Server The health check zookeeper_server_ critical:never, Cloudera Manager 4.6 Health Checks 119

130 ZooKeeper Server Quorum Membership Outstanding Requests of the weighted average size of the ZooKeeper Server outstanding requests queue over a recent period. See ZooKeeper Server Outstanding Requests Monitoring Period. outstanding_ requests_ warning:never ZooKeeper Server Quorum Membership Details: This is a ZooKeeper server role-level health check to verify that the server is part of a quorum. This check is disabled if the ZooKeeper Server is in standalone mode. The check returns "Concerning" health as long as the quorum membership status for the ZooKeeper Server was determined within the detection window and it is in leader election. The check returns "Bad" health if the ZooKeeper Server is not part of a quorum or the quorum status of the ZooKeeper Server could not be determined for the entire detection window. A failure of this health check may indicate a communication problem between this ZooKeeper Server and the rest of its peers or between the Cloudera Manager Service Monitor and the ZooKeeper Server. Check the ZooKeeper Server and Cloudera Manager Service Monitor logs for additional information. This test can be enabled or disabled using the Enable the Quorum Membership Check ZooKeeper Server monitoring setting. In addition, the Quorum Membership Detection Window setting can be used to adjust the time that the Cloudera Manager Service Monitor has to detect the ZooKeeper Server quorum membership status before this health check fails. Short Name: Quorum Membership. Enable the Quorum Membership Check Enables the quorum membership check for this ZooKeeper Server. zookeeper_server_ quorum_members hip_enabled Quorum Membership Detection Window The tolerance window that will be used in the detection of a ZooKeeper server's membership in a quorum. Specified in minutes. zookeeper_server_ quorum_members hip_detection_ window 3 MINUTES 120 Cloudera Manager 4.6 Health Checks

131 ZooKeeper Server Cloudera Manager Agent Health ZooKeeper Server Cloudera Manager Agent Health Details: This Server health check checks that the Cloudera Manager Agent on the Server host is heart beating correctly and that the process associated with the Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Server process, a lack of connectivity to the Cloudera Manager Agent on the Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Server has crashed or because the Server will not start or stop in a timely fashion. Check the Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Server host, or look in the Cloudera Manager Agent logs on the Server host for more details. This test can be enabled or disabled using the ZooKeeper Server Process Health Check Server monitoring setting. Short Name: Process Status ZooKeeper Server Process Health Check Enables the health check that the ZooKeeper Server's process state is consistent with the role configuration zookeeper_server_ scm_health_ enabled ZooKeeper Server Unexpected Exits Details: This Server health check checks that the Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window unexpected_exits_ critical:any, warning:never Cloudera Manager 4.6 Health Checks 121

132 Service Monitor File Descriptor configuration for the role. Service Monitor File Descriptor Details: This Service Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Service Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Service Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. servicemonitor_fd _ critical: , warning: Service Monitor Host Health Details: This Service Monitor health check factors in the health of the host upon which the Service Monitor is running. A failure of this check means that the host running the Service Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Service Monitor Host Health Check Service Monitor monitoring setting. Short Name: Host Health Service Monitor Host Health Check When computing the overall Service Monitor health, consider the host's health. servicemonitor_host _health_enabled 122 Cloudera Manager 4.6 Health Checks

133 Service Monitor Log Directory Free Space Service Monitor Log Directory Free Space Details: This Service Monitor health check checks that the filesystem containing the log directory of this Service Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Service Monitor monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Service Monitor Role Pipeline Details: This Service Monitor health check checks that no messages are being dropped by the role stage of the Service Monitor pipeline. A failure of this health check indicates a problem with the Service Monitor. This may indicate a configuration problem or a bug in the Service Monitor. This test can be configured using the Service Monitor Role Pipeline Monitoring Time Period monitoring setting. Short Name: Role Pipeline Service Monitor Role Pipeline The health check for monitoring the Service servicemonitor_role _pipeline_threshold critical:any, warning:never Cloudera Manager 4.6 Health Checks 123

134 Service Monitor Cloudera Manager Agent Health Monitoring Monitor role pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. s Service Monitor Role Pipeline Monitoring Time Period The time period over which the Service Monitor role pipeline will be monitored for dropped messages. servicemonitor_role _pipeline_window 5 MINUTES Service Monitor Cloudera Manager Agent Health Details: This Service Monitor health check checks that the Cloudera Manager Agent on the Service Monitor host is heart beating correctly and that the process associated with the Service Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Service Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Service Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Service Monitor has crashed or because the Service Monitor will not start or stop in a timely fashion. Check the Service Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Service Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Service Monitor host, or look in the Cloudera Manager Agent logs on the Service Monitor host for more details. This test can be enabled or disabled using the Service Monitor Process Health Check Service Monitor monitoring setting. Short Name: Process Status Service Monitor Process Health Check Enables the health check that the Service Monitor's process state is consistent with the role configuration servicemonitor_scm _health_enabled 124 Cloudera Manager 4.6 Health Checks

135 Service Monitor Unexpected Exits Service Monitor Unexpected Exits Details: This Service Monitor health check checks that the Service Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Service Monitor monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Service Monitor Web Metric Collection Details: This Service Monitor health check checks that the web server of the Service Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Service Monitor, a misconfiguration of the Service Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Service Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Service Monitor's web server are failing or timing out. These requests are completely local to the Service Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Service Monitor's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Service Monitor monitoring setting. Short Name: Web Server Status Cloudera Manager 4.6 Health Checks 125

136 StateStore File Descriptor Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. servicemonitor_web _metric_collection_ enabled StateStore File Descriptor Details: This Impala StateStore Daemon health check checks that the number of file descriptors used does not rise above some percentage of the Impala StateStore Daemon file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Impala StateStore Daemon monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. statestore_fd_ critical: , warning: StateStore Host Health Details: This Impala StateStore Daemon health check factors in the health of the host upon which the Impala StateStore Daemon is running. A failure of this check means that the host running the Impala StateStore Daemon is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the StateStore Host Health Check Impala StateStore Daemon monitoring setting. Short Name: Host Health StateStore Host Health Check When computing the overall StateStore health, consider statestore_host_ health_enabled 126 Cloudera Manager 4.6 Health Checks

137 StateStore Log Directory Free Space the host's health. StateStore Log Directory Free Space Details: This Impala StateStore Daemon health check checks that the filesystem containing the log directory of this Impala StateStore Daemon has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Impala StateStore Daemon monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never StateStore Memory Resident Set Size Health Details: This Impala StateStore Daemon health check checks that the size of the resident set does not rise above a configured threshold value. A failure of this health check may indicates that the Impala StateStore Daemon process is consuming more memory than expected. It is possible that that unexpected memory consumption may lead to swapping and decreased performance for processes running on the same host as this Impala StateStore Daemon. Increased Impala StateStore Daemon Cloudera Manager 4.6 Health Checks 127

138 StateStore Cloudera Manager Agent Health memory consumption may be caused by an increased workload on the Impala service, or by a bug in the Impala StateStore Daemon software. To avoid failures of this health check, free up additional memory for this Impala StateStore Daemon process and increase the Resident Set Size monitoring setting. This test can be configured using the Resident Set Size Impala StateStore Daemon monitoring setting. Short Name: Resident Set Size Resident Set Size The health check on the resident size of the process. process_resident_ set_size_ critical:never, warning:never BYTES StateStore Cloudera Manager Agent Health Details: This Impala StateStore Daemon health check checks that the Cloudera Manager Agent on the Impala StateStore Daemon host is heart beating correctly and that the process associated with the Impala StateStore Daemon role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Impala StateStore Daemon process, a lack of connectivity to the Cloudera Manager Agent on the Impala StateStore Daemon host, or a problem with the Cloudera Manager Agent. This check can fail either because the Impala StateStore Daemon has crashed or because the Impala StateStore Daemon will not start or stop in a timely fashion. Check the Impala StateStore Daemon logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Impala StateStore Daemon host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Impala StateStore Daemon host, or look in the Cloudera Manager Agent logs on the Impala StateStore Daemon host for more details. This test can be enabled or disabled using the StateStore Process Health Check Impala StateStore Daemon monitoring setting. Short Name: Process Status StateStore Process Health Check Enables the health check that the StateStore's process state is consistent with the role configuration statestore_scm_health _enabled 128 Cloudera Manager 4.6 Health Checks

139 StateStore Unexpected Exits StateStore Unexpected Exits Details: This Impala StateStore Daemon health check checks that the Impala StateStore Daemon has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Impala StateStore Daemon monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never StateStore Web Metric Collection Details: This Impala StateStore Daemon health check checks that the web server of the Impala StateStore Daemon is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Impala StateStore Daemon, a misconfiguration of the Impala StateStore Daemon or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Impala StateStore Daemon for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Impala StateStore Daemon's web server are failing or timing out. These requests are completely local to the Impala StateStore Daemon's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Impala StateStore Daemon's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Impala StateStore Daemon monitoring setting. Short Name: Web Server Status Cloudera Manager 4.6 Health Checks 129

140 TaskTracker Blacklisted Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. statestore_web_ metric_collection_ enabled TaskTracker Blacklisted Details: This is a TaskTracker health check that checks that the JobTracker has not blacklisted the TaskTracker. A failure of this health check indicates that the JobTracker has blacklisted the TaskTracker because of the failure rate of tasks on the TaskTracker is significantly higher than the average cluster failure rate. Check the TaskTracker logs for more details. This test can be enabled or disabled using the TaskTracker Blacklisted Health Check TaskTracker monitoring setting. Short Name: Blacklisted Status TaskTracker Blacklisted Health Check Enables the health check that the TaskTracker is not blacklisted tasktracker_blacklisted _health_enabled TaskTracker Connectivity Details: This is a TaskTracker health check that checks that the JobTracker considers the TaskTracker alive. A failure of this health check may indicate that the TaskTracker is having trouble communicating with the JobTracker. Look in the TaskTracker logs for more details. This test can be enabled or disabled using the TaskTracker Connectivity Health Check TaskTracker monitoring setting. The TaskTracker Connectivity Tolerance at Startup TaskTracker monitoring setting and the Health Check Startup Tolerance JobTracker monitoring setting can be used to control the check's tolerance windows around TaskTracker and JobTracker restarts respectively. Short Name: JobTracker Connectivity Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on jobtracker_startup_ tolerance 5 MINUTES 130 Cloudera Manager 4.6 Health Checks

141 TaskTracker File Descriptor communication with this role will be tolerated. TaskTracker Connectivity Health Check Enables the health check that the TaskTracker is connected to the JobTracker tasktracker_ connectivity_health _enabled TaskTracker Connectivity Tolerance at Startup The amount of time to wait for the TaskTracker to fully start up and connect to the JobTracker before enforcing the connectivity check. tasktracker_ connectivity_ tolerance 180 SECONDS TaskTracker File Descriptor Details: This TaskTracker health check checks that the number of file descriptors used does not rise above some percentage of the TaskTracker file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring TaskTracker monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. tasktracker_fd_ critical: , warning: TaskTracker Garbage Collection Duration Details: This TaskTracker health check checks that the TaskTracker is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the TaskTracker. This test can be configured using the TaskTracker Garbage Collection Duration and TaskTracker Garbage Collection Duration Monitoring Cloudera Manager 4.6 Health Checks 131

142 TaskTracker Host Health Period TaskTracker monitoring settings. Short Name: GC Duration TaskTracker Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. tasktracker_gc_ duration_window 5 MINUTES TaskTracker Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See TaskTracker Garbage Collection Duration Monitoring Period. tasktracker_gc_ duration_ critical: , warning: TaskTracker Host Health Details: This TaskTracker health check factors in the health of the host upon which the TaskTracker is running. A failure of this check means that the host running the TaskTracker is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the TaskTracker Host Health Check TaskTracker monitoring setting. Short Name: Host Health TaskTracker Host Health Check When computing the overall TaskTracker health, consider the host's health. tasktracker_host_ health_enabled TaskTracker Log Directory Free Space Details: This TaskTracker health check checks that the filesystem containing the log directory of this TaskTracker has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage 132 Cloudera Manager 4.6 Health Checks

143 TaskTracker Cloudera Manager Agent Health TaskTracker monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never TaskTracker Cloudera Manager Agent Health Details: This TaskTracker health check checks that the Cloudera Manager Agent on the TaskTracker host is heart beating correctly and that the process associated with the TaskTracker role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the TaskTracker process, a lack of connectivity to the Cloudera Manager Agent on the TaskTracker host, or a problem with the Cloudera Manager Agent. This check can fail either because the TaskTracker has crashed or because the TaskTracker will not start or stop in a timely fashion. Check the TaskTracker logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the TaskTracker host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the TaskTracker host, or look in the Cloudera Manager Agent logs on the TaskTracker host for more details. This test can be enabled or disabled using the TaskTracker Process Health Check TaskTracker monitoring setting. Short Name: Process Status Cloudera Manager 4.6 Health Checks 133

144 TaskTracker Unexpected Exits TaskTracker Process Health Check Enables the health check that the TaskTracker's process state is consistent with the role configuration tasktracker_scm_ health_enabled TaskTracker Unexpected Exits Details: This TaskTracker health check checks that the TaskTracker has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period TaskTracker monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never TaskTracker Web Metric Collection Details: This TaskTracker health check checks that the web server of the TaskTracker is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the TaskTracker, a misconfiguration of the TaskTracker or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the TaskTracker for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the TaskTracker's web server are failing or timing out. These requests are completely local to the TaskTracker's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the TaskTracker's web server responded to the 134 Cloudera Manager 4.6 Health Checks

145 ZooKeeper Canary Health Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection TaskTracker monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. tasktracker_web_ metric_collection_ enabled ZooKeeper Canary Health Details: This is a ZooKeeper service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it connects to and establishes a session (the root session) with the ZooKeeper service and creates a permanent znode to serve as the root of all canary operations. The canary test then connects to and establishes sessions (the child sessions) with each ZooKeeper server of the service. Each child session is used to create an ephemeral child znode under the canary root. After the child znodes have been created, watches that await znode deletion events are registered with each of the child znodes for each of the child sessions. The canary test then deletes each of the child znodes and then verifies that each child session has received deletion notifications for each of the child znodes. Finally the canary test closes all the child sessions, deletes the root znode and closes the root session. The check returns "Bad" health if the establishment of the root session to the ZooKeeper service fails, the creation of znodes (permanent or ephemeral) fails, the deletion of znodes fails or the retrieval of child znodes of the root znode fails. The check returns "Concerning" health when the canary test succeeds but has one or more servers that could not participate in the canary test operations or if the canary test runs too slowly. A failure of this health check may indicate that ZooKeeper is failing to satisfy client requests correctly or in a timely fashion. Check the status of the ZooKeeper servers, and look in the ZooKeeper server logs for more details. This test can be enabled or disabled using the ZooKeeper Canary Health Check ZooKeeper service monitoring setting. The ZooKeeper Canary Root Znode Path, ZooKeeper Canary Connection Timeout, ZooKeeper Canary Session Timeout, ZooKeeper Canary Operation Timeout settings control the operation of the canary. Short Name: ZooKeeper Canary Cloudera Manager 4.6 Health Checks 135

146 ZooKeeper Current Zxid ZooKeeper Canary Connection Timeout Configures the timeout used by the canary for connection establishment with ZooKeeper servers zookeeper_canary _connection_time out MILLISECONDS ZooKeeper Canary Health Check Enables the health check that a client can connect to ZooKeeper and perform basic operations zookeeper_canary _health_enabled ZooKeeper Canary Operation Timeout Configures the timeout used by the canary for ZooKeeper operations zookeeper_canary _operation_ timeout MILLISECONDS ZooKeeper Canary Root Znode Path Configures the path of the root znode under which all canary updates are performed zookeeper_canary _root_path /cloudera_man ager_zookeeper _canary ZooKeeper Canary Session Timeout Configures the timeout used by the canary sessions with ZooKeeper servers zookeeper_canary _session_timeout MILLISECONDS ZooKeeper Current Zxid Details: This ZooKeeper service-level health check monitors the current zxid to ensure that its xid component does not rollover. The zxid is a 64-bit number maintained by ZooKeeper and is made up of two parts. The higher order 32-bit part is the epoch and the lower order 32-bit part is the xid. This check concerns itself with the xid portion that has a maximum possible value of 0xffffffff. If the xid reaches this value a rollover can occur. The check returns "Concerning" or "Bad" health if the current xid is above a warning threshold or critical threshold respectively. The threshold is expressed as a percentage of the maximum possible xid. For example, if this check is configured with a warning percentage threshold of 80% and a critical percentage threshold of 95% for a ZooKeeper service, this check would return "Good" health if the current xid is less than 0xcccccccc. This check would return "Concerning" health if the current xid is between 0xcccccccc and 0xf If the current xid is above 0xf , this check would return "Bad" health. A failure of this health check indicates that an overflow of xid may occur in the near future if the corrective action of forcing a leader election is not taken. This test is disabled by default since rollover of the xid is a concern only in releases prior to CDH3u4. For those releases, the test 136 Cloudera Manager 4.6 Health Checks

147 ZooKeeper Servers Health needs to be enabled explicitly. This test can be configured using the ZooKeeper Current Zxid Monitoring Percentage ZooKeeper service-wide monitoring setting. Short Name: ZXID Rollover. ZooKeeper Current Zxid Monitoring Percentage The health check for monitoring of the xid portion of the current zxid of the service. Specified as a percentage of the maximum possible xid setting of 0xffffffff. zookeeper_current _zxid_percentage_ critical:never, warning:never ZooKeeper Servers Health Details: This is a ZooKeeper service-level health check that checks that enough of the ZooKeeper servers in the cluster are healthy. The check returns "Concerning" health if the number of healthy ZooKeeper servers falls below a warning threshold, expressed as a percentage of the total number of ZooKeeper servers. The check returns "Bad" health if the number of healthy and "Concerning" ZooKeeper servers falls below a critical threshold, expressed as a percentage of the total number of ZooKeeper servers. For example, if this check is configured with a warning threshold of 80% and a critical threshold of 60% for a cluster of 5 ZooKeeper servers, this check would return "Good" health if 4 or more ZooKeeper servers have good health. This check would return "Concerning" health if at least 3 ZooKeeper servers have either "Good" or "Concerning" health. If more than 2 ZooKeeper servers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy ZooKeeper servers. Check the status of the individual ZooKeeper servers for more information. This test can be configured using the Healthy ZooKeeper Server Monitoring ZooKeeper service-wide monitoring setting. Short Name: ZooKeeper Servers Health Healthy ZooKeeper Server Monitoring The health check of the overall ZooKeeper service health. The check returns "Concerning" health if the percentage of "Healthy" ZooKeeper servers falls below the warning threshold. The check is unhealthy if the zookeeper_servers_ healthy_ critical: , warning: Cloudera Manager 4.6 Health Checks 137

148 ZooKeeper Servers Health total percentage of "Healthy" and "Concerning" ZooKeeper servers falls below the critical threshold. 138 Cloudera Manager 4.6 Health Checks