VR Health Checks | CloudStack Feature First Look

This feature introduces an easy and integrated way to check the health of virtual routers (VRs) within CloudStack. With the help of these checks, administrators can monitor VRs and take any necessary action when a failure is reported. These health checks can be basic or advanced.

Basic health checks include:

  • Connectivity from the management server to the virtual router
  • Connectivity from a virtual router to its interfaces’ gateways
  • Free disk space on virtual router
  • CPU and memory usage
  • VR Sanity checks: SSH/dnsmasq/haproxy/httpd services running

Advanced health checks include:

  • DHCP / DNS configuration matches management server DB
  • IPtables port forwarding rules match management server records
  • HAproxy configuration matches management server DB records
  • VR Version against the current version: this check is done by comparing the contents of the ‘/etc/cloudstack-release’ and ‘/var/cache/cloud/cloud-scripts-signature’ files with the data given by the management server

These health checks are run on each virtual router using the information that the management server periodically sends. After the virtual router completes the health checks, it stores the results in a dedicated JSON file for basic and advanced checks. The management server retrieves these results and stores them in database.

The administrator can easily retrieve the health checks results on virtual routers via the ‘getRouterHealthCheckResults‘ API or through the UI, in the new tab for Health Checks for each VR. This tab displays individual test results.

The health checks files are located in the ‘/root/healthchecks/’ directory on each virtual router. If the administrator wants to add more health checks, he can add them to this directory by  creating  a new system VM template or updating the existing system VM ISO.

Administrators can control VR health checks using global settings, but some of them can be overridden on a zone level. The as follows:

VR health checks are enabled by default. However, the administrator can disable it using this global setting:

  • health.checks.enabled.

Advanced and basic health checks are controlled byusing these global settings:

  • health.checks.advanced.interval
  • health.checks.basic.interval
  • health.checks.config.refresh.interval
  • health.checks.results.fetch.interval

A set of health checks can be ignored if the administrator sets them as a comma-separated list in this global setting:

  • health.checks.to.exclude

Health check failures defined by this configuration are the checks that should cause router to be recreated:

  • health.checks.failures.to.recreate.vr

Thresholds from which a test will fail if the value obtained is below it:

  • health.checks.free.disk.space.threshold
  • health.checks.max.cpu.usage.threshold
  • health.checks.max.memory.usage.threshold