Secure Live KVM VM Migration with CloudStack 4.11.1



CloudStack 4.11.1 introduces a new security enhancement on top of the new CA framework to secure live KVM VM migrations. This feature allows live migration of guest VMs across KVM hosts using secured TLS enabled libvirtd process. Without this feature, the live migration of guest VMs across KVM hosts would use an unsecured TCP connection, which is prone to man-in-the-middle attacks leading to leakage of critical VM data (the VM state and memory). This feature brings stability and security enhancements for CloudStack and KVM users.


The initial implementation of the CA framework was limited to the provisioning of X509 certificates to secure the KVM/CPVM/SSVM agent(s)  and the CloudStack management server(s). With the new enhancement, the X509 certificates are now also used by the libvirtd process on the KVM host to secure live VM migration to another secured KVM host.

The migration URI used by two secured KVM hosts is qemu+tls:// as opposed to qemu+tcp:// that is used by an unsecured host. We’ve also enforced that live VM migration is allowed only between either two secured KVM hosts or two unsecured hosts, but not between KVM hosts with a different security configuration. Between two secured KVM hosts, the web of trust is established by the common root CA certificate that can validate the server certificate chain when live VM migration is initiated.

As part of the process of securing a KVM host the CA framework issues X509 certificates and provisions them to a KVM host and libvirtd is reconfigured to listen on the default TLS port of 16514 and use the same X509 certificates as used by thecloudstack-agent. In an existing environment, the admin will need to ensure that the default TLS port 16514 is not blocked however in a fresh environment suitable iptables rules and other configurations are done via cloudstack-setup-agent using a new '-s' flag.

Starting CloudStack 4.11.1, hosts that don’t have both cloudstack-agent and libvirtd processes secured and in Up state will show up in ‘Unsecure’ state in the UI (and in host details as part of listHosts API response):

This will allow admins to easily identify and secure hosts using a new ‘provision certificate’ button that can be used from the host’s details tab in the UI:

After a KVM host is successfully secured it will show up in the Up state:

As part of the onboarding and securing process, after securing all the KVM hosts the admin can also enforce authentication strictness of client X509 certificates by the CA framework, by setting the global setting ‘ca.plugin.root.auth.strictness' to true (this does not require restarting of the management server(s)).

About the author

Rohit Yadav is a Software Architect at ShapeBlue, the Cloud Specialists, and is a committer and PMC member of Apache CloudStack. Rohit spends most of his time designing and implementing features in Apache CloudStack.

Software based agent LB for CloudStack

, ,


Last year we implemented a new CA Framework on CloudStack 4.11 to make communications between CloudStack management servers it’s hypervisor agents more secure. As part of that work, we introduced the ability for CloudStack agents to connect to multiple management servers, avoiding the usage of an external load balancer.

We’ve now extended the CA Framework by implementing load balance sorting algorithms which are applied to the list of management servers before being sent to the indirect agents.  This allows the CloudStack management servers to balance the agent load between themselves, with no reliance on an external load balancer. This will  be available in CloudStack 4.11.1 The new functionality also  introduces the notion of a preferred management server for agents, and a background mechanism to check and eventually connect to the preferred management server (assumed to be the first on the list the agent receives).


The CloudStack administrator is responsible for setting the list of management servers to connect to and an algorithm (to sort the management servers list) from the CloudStack management server using global configurations.

Management server perspective

This feature uses (and introduces) these configurations:

  • ‘’: The algorithm to be applied to the list of management servers on ‘host’ configuration before being sent to the agents. Allowed algorithm values are:
    • ‘static’: Each agent receives the same list as provided on ‘host’ configuration. Therefore, no load balancing performed.
    • ’roundrobin’: The agents are evenly spread across management servers
    • ‘shuffle’: Randomly sorts the list before being sent to each agent.
  • ‘’: The interval in seconds after which agent should check and try to connect to its preferred host.

Any changes to these global configurations are dynamic and do not require restarting the management server.

There are three cases in which new lists are propagated to the agents:

  • Addition of a host
  • Connection or reconnection of an agent
  • A change on the ‘host’ or ‘’ configurations

Agents perspective

Agents receive the list of management servers, the algorithm and the check interval (if provided) and persist them on their file as:


The first management server on the list is considered the preferred host. The check interval to check for the preferred host should be greater than 0, in which case it is persisted on on the ‘’ key. In case the interval is greater than 0 and the host which the agent is connected to is not the preferred host, the agent will attempt connection to the preferred host

When connection is established between an agent and a management server, the agent sends its list of management servers. The management server checks if the list the agent has is up to date, sending the updated list if it is outdated. This behaviour ensures that each agent should get the updated version of the list of management servers even after any failure.


Assuming a test environment consisting on:

  • 3 management servers: M1, M2 and M3
  • 4 KVM hosts: H1, H2, H3 and H4

The ‘host’ global configuration should be set to ‘M1,M2,M3’

If the CloudStack administrator wishes no load balancing between agents and management servers, it would set the ‘static’ algorithm as the ‘’ global configuration. Each agent receives the same list (M1,M2,M3), and will be connected to the same management server.

If the CloudStack administrator wishes to balance connections between agents and management servers, the ’roundrobin’ algorithm is recommended. In this case:

  • H1 receives the list (M1, M2, M3)
  • H2 receives the list (M2, M3, M1)
  • H3 receives the list (M3, M1, M2)
  • H4 receives the list (M1, M2, M3)

There is also a ‘shuffle’ algorithm, in which the list is randomized before being sent to any agent. With this algorithm, the CloudStack administrator has no control of the load balancing so it is not recommened production use at the moment.

Combined with the algorithm, the CloudStack administrator can also set the ‘’ global configuration to ‘X’. This ensures that each X seconds, every agent will check if the management server they are connected to is the same as the first element of their list (preferred host). If there is a mismatch, the agent will attempt connecting to the preferred host.

About the author

Nicolas Vazquez is a Senior Software Engineer at ShapeBlue, the Cloud Specialists, and is a committer in the Apache CloudStack project. Nicolas spends his time designing and implementing features in Apache CloudStack.

What’s new in CloudStack 4.11?

, ,

Version 4.11 of Apache CloudStack has been released with some exciting new features and a long list of improvements and fixes. It includes more than 400 commits, 220 pull requests, and fixes more than 250 issues.  This version has been worked on for 8 months and is the first release of the 4.11 LTS releases, which will be supported until  1 July 2019.

We’ve been heavily involved in this release at ShapeBlue; our engineering team has contributed a number of the major new features and our own Rohit Yadav has been the 4.11 Release Manager.

As well as some really interesting new features, CloudStack 4.11 has significant performance and reliability improvements to the Virtual Router.

This is far from an exhaustive list, but here are the headline items that we think are most significant.

New Features and Improvements

  • Support for XenServer 7.1 and 7.2, and improved support for VMware 6.5.
  • Host-HA framework and HA-provider for KVM hosts with and NFS as primary storage, and a new background polling task manager.
  • Secure agents communication: new certificate authority framework and a default built-in root CA provider.
  • New network type – L2.
  • CloudStack metrics exporter for Prometheus.
  • Cloudian Hyperstore connector for CloudStack.
  • Annotation feature for CloudStack entities such as hosts.
  • Separation of volume snapshot creation on primary storage and backing operation on secondary storage.
  • Limit admin access from specified CIDRs.
  • Expansion of Management IP Range.
  • Dedication of public IPs to SSVM and CPVM.
  • Support for separate subnet for SSVM and CPVM.
  • Bypass secondary storage template copy/transfer for KVM.
  • Support for multi-disk OVA template for VMware.
  • Storage overprovisioning for local storage.
  • LDAP mapping with domain scope, and mapping of LDAP group to an account.
  • Move user across accounts.
  • Support for “VSD managed” networks with Nuage Networks.
  • Extend config drive support for user data, metadata, and password (Nuage networks).
  • Nuage domain template selection per VPC and support for network migration.
  • Managed storage enhancements.
  • Support for watchdog timer to KVM Instances.
  • Support for Secondary IPv6 Addresses and Subnets.
  • IPv6 Prefix Delegation support in basic networking.
  • Ability to specific MAC address while deploying VM or adding a NIC to a VM.
  • VMware dvSwitch security policies configuration in network offering
  • Allow more than 7 NICs to be added to a VMware VM.
  • Network rate usage for guest offering for VRs.
  • Usage metrics for VM snapshot on primary storage.
  • Enable Netscaler inline mode.
  • NCC integration in CloudStack.
  • The retirement of the Midonet network plugin.

UI Improvements

  • High precision of metrics percentage in the dashboard:
  • Event timeline – filter related events:

  • Navigation improvements between related entities:
  • Bulk operation support for stopping and destroying VMs (note: minor known issue where manual refresh required afterwards):
  • List view improvements and additional columns with state icon:

Structural Improvements

  • Embedded Jetty and improved CloudStack management server configuration.
  • Improved support for Java 8 for building artifacts/modules, packaging, and in the systemvm template.
  • New Debian 9 based systemvm template:
    • Patches system VM without reboot, reduces VR/system VM startup time to few tens of seconds.
    • Faster console proxy startup and service availability.
    • Improved support for redundant virtual routers, conntrackd and keepalived.
    • Improved strongswan provided VPN (s2s and remote access).
    • Packer based systemvm template generation and reduced disk size.
    • Several optimization and improvements.

Documentation and Downloads

The official installation, administration and API documentation can be found below: 

The release notes can be found at: 

The instruction and links to use ShapeBlue provided (noredist) packages repository can be found at: 

CloudStack usage service deep dive


CloudStack usage is a complimentary service which tracks end user consumption of CloudStack resources and summarises this in a separate database for reporting or billing. The usage database can be queried directly, through the CloudStack API, or it can be integrated into external billing or reporting systems.

For background information on the usage service please refer to the CloudStack documentation set:

In this blog post we will go a step further and deep dive into how the usage service works, how you can run usage reports from the database either directly or through the API, and also how to troubleshoot this.

Please note – in this blog post we will be discussing the underlying database structure for the CloudStack management and usage services. Whilst these have separate databases they do in some cases share table names – hence please note the databases referenced throughout – e.g. cloud.usage_event versus cloudstack_usage.usage_event, etc.



As per the official CloudStack documentation the usage service is simply installed and started. In CentOS/RHEL this is done as follows:

# yum install cloudstack-usage
# chkconfig cloudstack-usage on
# service cloudstack-usage on

whilst on a Debian/Ubuntu server:

# apt-get install cloudstack-usage
# update-rc.d cloudstack-usage defaults
# service cloudstack-usage on

Once configure the usage service will use the same MySQL connection details as the main CloudStack management service. This is automatically added when the management service is configured with the “cloudstack-setup-databases” script (refer to The usage service installation simply adds a symbolic link to the same file as is used by cloudstack-management:

# ls -l /etc/cloudstack/usage/ total 4 
lrwxrwxrwx. 1 root root 40 Sep 8 08:18 > /etc/cloudstack/management/ 
lrwxrwxrwx. 1 root root 30 Sep 8 08:18 key > /etc/cloudstack/management/key 
-rw-r--r--. 1 root root 2968 Jul 12 10:36 log4j-cloud.xml 

Please note whilst the cloudstack-usage and cloudstack-management service share the same configuration file this will still contain individual settings for each service:

# grep -i usage /etc/cloudstack/usage/
# usage database tuning parameters
# usage database settings
db.usage.failOverReadOnly=false DB host IP address)
db.usage.password=ENC(Encrypted password)
#usage Database

Note the above settings would need changed if:

  • the usage DB is installed on a different MySQL server than the main CloudStack database
  • if the usage database is using a different set of login credentials

Also note that the passwords in the file above are encrypted using the method specified during the “cloudstack-setup-databases” script run – hence this also uses the referenced “key” file as shown in the above folder listing.

Application settings

Once installed the usage service is configured with the following global settings in CloudStack:

  • enable.usage.server:
    • Switches usage service on/off
    • true|false
  • usage.aggregation.timezone:
    • Timezone used for usage aggregation.
    • Refer to for formatting.
    • Defaults to “GMT”.
  • usage.execution.timezone:
    • Timezone for usage job execution.
    • Refer to for formatting.
  • usage.sanity.check.interval:
    • Interval (in days) to check sanity of usage data.
    • Set the value to true if snapshot usage need to consider virtual size, else physical size is considered.
    • true|false – defaults to false.
  • usage.stats.job.aggregation.range:
    • The range of time for aggregating the user statistics specified in minutes (e.g. 1440 for daily, 60 for hourly. Default is 60 minutes).
    • Please note this setting would be changed in a chargeback situation where VM resources are charged on an hourly/daily/monthly basis.
  • usage.stats.job.exec.time:
    • The time at which the usage statistics aggregation job will run as an HH:MM time, e.g. 00:30 to run at 12:30am.
    • Default is 00:15.
    • Please note this time follows the setting in usage.execution.timezone above.

Please note – if any of these settings are updated then only the cloudstack-usage service needs restarted (i.e. there is no need to restart cloudstack-management).

Usage types

To track the resources utilised in CloudStack every API call where a resource is created, destroyed, stopped, started, requested and released are tracked in the cloud.usage_event table. This table has entries for every event since the start of the CloudStack instance creation, hence may grow to become quite big.

During processing every event in this table are assigned a usage type. The usage types are listed in the CloudStack documentation, or it can simply be queried using the CloudStack “listUsagetypes” API call:

# cloudmonkey list usagetypes
count = 19
| usagetypeid | description                             |
|  1          |  Running Vm Usage                       |
|  2          |  Allocated Vm Usage                     |
|  3          |  IP Address Usage                       |
|  4          |  Network Usage (Bytes Sent)             |
|  5          |  Network Usage (Bytes Received)         |
|  6          |  Volume Usage                           |
|  7          |  Template Usage                         |
|  8          |  ISO Usage                              |
|  9          |  Snapshot Usage                         |
| 10          |  Security Group Usage                   |
| 11          |  Load Balancer Usage                    |
| 12          |  Port Forwarding Usage                  |
| 13          |  Network Offering Usage                 |
| 14          |  VPN users usage                        |
| 21          |  VM Disk usage(I/O Read)                |
| 22          |  VM Disk usage(I/O Write)               |
| 23          |  VM Disk usage(Bytes Read)              |
| 24          |  VM Disk usage(Bytes Write)             |
| 25          |  VM Snapshot storage usage              |

Please note these usage types are calculated depending on the nature of resource used, e.g.:

  • “Running VM usage” will simply count the hours a single VM instance is used.
  • “Volume usage” will however track both the size of each volume in addition to the time utilised.

Process flow


From a high level point of view the usage service processes data already generated by the CloudStack management service, copies this to the cloud_usage database before processing and aggregating the data in the cloud_usage.cloud_usage database:



Using a running VM instance as example the data process flow is as follows.

Usage_event table entries

CloudStack management writes all events to the cloud.usage_event table. This happens whether the cloudstack-usage service is running or not.

In this example we will track the VM with instance ID 17. The resource tracked – be it a VM, a volume, a port forwarding rule , etc. – is listed in the usage_event table as “resource_id”, which points to the main ID field in the vm_instance, volume tables etc.

   type like '%VM%' and resource_id=17;

68VM.CREATE62017-09-08 11:14:31117bbannervm12175NULLXenServer0NULL
70VM.START62017-09-08 11:14:41117bbannervm12175NULLXenServer0NULL
123VM.STOP62017-09-26 13:44:48117bbannervm12175NULLXenServer0NULL
125VM.DESTROY62017-09-26 13:45:00117bbannervm12175NULLXenServer0NULL

Please note: a lot of the resources will obviously still be in use – i.e. they will not have a destroy/release entry. In this case the usage service considers the end date to be open, i.e. all calculations are up until today.

Usage_event copy

When the usage job runs (at “usage.stats.job.exec.time”) it first copies all new entries since the last processing time from the cloud.usage_event table to the cloud_usage.usage_event table.

The only difference between the two tables is the “processed” column – in the cloud database this is always set to 0 – nil, however once the table entry is processed in the cloud_usage database this field is updated to 1.

In comparison – the entries in the cloud database:

   id > 130;
131VOLUME.CREATE62017-09-26 13:45:44131bbannerdata36NULL2147483648NULL0NULL
132NET.IPASSIGN62017-09-26 13:46:0511710.1.34.77NULL00VirtualNetwork0NULL
133VM.STOP82017-09-28 10:31:44123secretprojectvm1175NULLXenServer0NULL
134NETWORK.OFFERING.REMOVE82017-09-28 10:31:44123418NULL0NULL0NULL

Compared to the same entries in cloud_usage:

   id > 130;
131VOLUME.CREATE62017-09-26 13:45:44131bbannerdata36NULL2147483648NULL1NULL
132NET.IPASSIGN62017-09-26 13:46:0511710.1.34.77NULL00VirtualNetwork1NULL
133VM.STOP82017-09-28 10:31:44123secretprojectvm1175NULLXenServer1NULL
134NETWORK.OFFERING.REMOVE82017-09-28 10:31:44123418NULL0NULL1NULL

Account copy

As part of this copy job the cloudstack-usage service will also make a copy of some of the columns in the cloud.account table such that a ownership of resources can be easily established during processing.

Usage summary and helper tables

In the first usage aggregation step all usage data per account and per usage type is summarised in helper tables. Continuing the example above the CREATE+DESTROY events as well as the VM START+STOP events are summarised in the “usage_vm_instance” table:


11617bbannervm12175XenServer2017-09-08 11:14:412017-09-26 13:44:48NULLNULLNULL
21617bbannervm12175XenServer2017-09-08 11:14:312017-09-26 13:45:00NULLNULLNULL

Note the helper table has now summarised the data with the usage type mentioned above – and the start/end dates are contained in the same database row.

Please note – if a resource is still in use then the end date simply isn’t populated, i.e. all calculations will work on rolling end date of today.

If we now also compare the volume used by VM instance ID 17 we find this in the cloud_usage.usage_volume helper table:

 cloud.volumes ON ( =
 cloud.volumes.instance_id = 17;
18162NULL5214748364802017-09-08 11:14:312017-09-26 13:45:00

As the database selects above show – each helper table will contain only the information pertinent to that specific usage type, hence the cloud_usage.usage_vm_instance contains information about VM service offering, template and hypervisor type the cloud_usage.usage_volume contains information about disk offering ID, template ID and size.

If a usage type for a resource has been started/stopped or requested/released multiple times then each period of use will be listed in the helper tables:


11612bbannervm2175XenServer2017-09-08 09:30:372017-09-08 09:30:49NULLNULLNULL
11612bbannervm2175XenServer2017-09-08 11:14:03NULLNULLNULLNULL
21612bbannervm2175XenServer2017-09-08 09:30:20NULLNULLNULLNULL

Usage data aggregation

Once all helper tables have been populated the usage service now creates time aggregated database entries in the cloud_usage.cloud_usage table. In all simplicity this process:

  1. Analyses all entries in the helper tables.
  2. Splits up this data based on “usage.stats.job.aggregation.range” to create individual usage timeblocks.
  3. Repeats this process for all accounts and for all resources.

So – looking at the VM with ID=17 analysed above:

  • This had a running start date of 2017-09-08 11:14:41, an end date of 2017-09-26 13:44:48.
  • The usage service is set up with usage.stats.job.aggregation.range=1440, i.e. 24 hours.
  • The usage service will now create entries in the cloud_usage.cloud_usage table for every full and partial 24 hour period this VM was running.

   usage_id=17 and usage_type=1;

64162bbannervm12 running time (ServiceOffering: 17) (Template: 5)12.755278 Hrs112.75527763366699217bbannervm1217517XenServerNULLNULL2017-09-08 00:00:002017-09-08 23:59:59NULLNULLNULLNULL0
146162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-09 00:00:002017-09-09 23:59:59NULLNULLNULLNULL0
221162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-10 00:00:002017-09-10 23:59:59NULLNULLNULLNULL0
1271162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-24 00:00:002017-09-24 23:59:59NULLNULLNULLNULL0
1346162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-25 00:00:002017-09-25 23:59:59NULLNULLNULLNULL0
1427162bbannervm12 running time (ServiceOffering: 17) (Template: 5)13.746667 Hrs113.7466669082641617bbannervm1217517XenServerNULLNULL2017-09-26 00:00:002017-09-26 23:59:59NULLNULLNULLNULL0

Since all of these entries are split into specific dates it is now relatively straight forward to run a report to capture all resource usage for an account over a specific time period, e.g. if a monthly bill is required.

Querying usage data through the API

The usage records can also be queried through the API by using the “listUsagerecords” API call. This uses similar syntax to the above – but there are some differences:

  • The API call requires start and end dates, these are in a “yyyy-MM-dd HH:mm:ss” or simply a “yyyy-MM-dd” format.
  • The usage type is same as above, e.g. type=1 for running VMs.
  • Usage ID is however the UUID attached to the resource in question, e.g. in the following example VM ID 17 actually has UUID 4358f436-bc9b-4793-b1be-95fa9b074fd5 in the vm_instance table.
  • The API call can also be filtered for account/accountid/domain.

More information on the syntax can be found in .

The following API query will list the first three day’s worth of usage data listed in the table above:

# cloudmonkey list usagerecords type=1 startdate=2017-09-09 enddate=2017-09-10 usageid=4358f436-bc9b-4793-b1be-95fa9b074fd5
count = 3
| startdate                   | account | domainid                             | enddate                     | description                                                  | name        | virtualmachineid                     | offeringid                           | usagetype | domain     | zoneid                               | rawusage | templateid                           | usage         | usageid                              | type      | accountid                            |
| 2017-09-08'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-08'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 12.755278| 47dd8c98-946e-11e7-b419-0666ae010714 | 12.755278 Hrs | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |
| 2017-09-09'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-09'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 24       | 47dd8c98-946e-11e7-b419-0666ae010714 | 24 Hrs        | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |
| 2017-09-10'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-10'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 24       | 47dd8c98-946e-11e7-b419-0666ae010714 | 24 Hrs        | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |

Analysing and reporting on usage data

The usage data can be analysed in any reporting tool – from the various CloudStack billing platforms, to enterprise billing systems as well as simpler tools like Excel. Since the cloud_usage.cloud_usage data is fully aggregated into time utilised blocks, it is now just a question of summarising data based on usage type, accounts, service offerings, etc.

The following SQL queries are provided as examples only – in a real use case these will most likely require to be changed and refined to the specific reporting requirements.

Running VMs

To find usage data for all running VMs run during the month of September we search for usage type=1 and group by vm_instance. For a VM instance we summarise how many hours each VM has been running – however in a real billing scenario this would most likely also be broken down into e.g. how many hours of VM usage has been utilised per VM service offering.

   SUM(raw_usage) as VMRunHours
   cloud_usage.account on (cloud_usage.account_id =
   start_date LIKE '2017-09%' 
   AND usage_type = 1
   account_id ASC, vm_instance_id ASC;

Network utilisation

The following will summarise network usage for sent (usage type=4) and received (usage type=5) traffic on a per account basis, again this is listing for the month of September.

For network utilisation the usage is simply summarised as total Bytes sent or received:

   SUM(raw_usage) as TotalBytes
   cloud_usage.account on (cloud_usage.account_id =
   start_date LIKE '2017-09%' 
   AND usage_type in (4,5)
   account_id, usage_type
   account_id ASC;

Volume utilisation

For volume or general storage utilisation (applies to snapshots as well) the usage is calculated as storage hours – e.g. GbHours. In this example we again summarise for all volumes (usage type=6) on a per account and disk basis during the month of September. Please note in this case we have to do multiple joins (or nested WHERE statements) to look up volume IDs, VM name, etc.

   cloud_usage.cloud_usage.usage_id, as Instance_Name, as Volume_Name,
   cloud_usage.cloud_usage.size/(1024*1024*1024) as DiskSizeGb,
   SUM(cloud_usage.cloud_usage.raw_usage) as TotalHours,
   sum(cloud_usage.cloud_usage.raw_usage*cloud_usage.cloud_usage.size/(1024*1024*1024)) as GbHours
   cloud_usage.account on (cloud_usage.account_id =
   cloud.volumes on (cloud_usage.usage_id =
   cloud.vm_instance on (cloud.volumes.instance_id =
   start_date LIKE '2017-09%' AND usage_type = 6
   account_id ASC, usage_id ASC;



IP addresses, port forwarding rules and VPN users

For other usage types where – similar to VM running hours – we simply report on the total hours utilised we again summarise the raw_usage, but since the description in is clear enough we don’t need to go looking elsewhere for this information. In the following example we report on IP address usage (usage type=3), port forwarding rules (12) and VPN users (14):

   SUM(cloud_usage.cloud_usage.raw_usage) as TotalHours
   cloud_usage.account on (cloud_usage.account_id =
   start_date LIKE '2017-09%' AND usage_type in (3,12,14)
   account_id ASC, usage_id ASC;


6bbanner141VPN User: bbannervpn1, Id: 1 usage time542.4766664505005
6bbanner142VPN User: brucesdogvpn1, Id: 2 usage time1.7355557680130005
6bbanner143VPN User: bruceswifevpn1, Id: 3 usage time540.7405557632446
6bbanner144VPN User: stanleevpn1, Id: 4 usage time540.7180547714233
6bbanner129Port Forwarding Rule: 9 usage time1.6469446420669556


Service management

As described earlier in this blog post the usage job will run at a time specified in the usage.stats.job.exec.time global setting.

Once the job has ran it will update its own internal database with the run time and the start/end times processed:

SELECT * FROM cloud_usage.usage_job;


1acshostname/ 00:00:002017-09-08 23:59:5912017-09-09 00:14:53
2acshostname/ 00:00:002017-09-09 23:59:5912017-09-10 00:14:53
3acshostname/ 00:00:002017-09-10 23:59:5912017-09-11 00:14:53
4acshostname/ 00:00:002017-09-11 23:59:5912017-09-12 00:14:53
5acshostname/ 00:00:002017-09-12 23:59:5912017-09-13 00:14:53

A couple of things to note on this lists:

  • Start_millis and end_millis simply list the epoch timestamp in start_date and end_date. The epoch time is used by the usage service to determine cloud_usage.cloud_usage entries.
  • Exec_time will list how long the usage job ran for. This is useful in cases where the usage job processing time is longer than 24 hours – i.e. where usage job schedules may start overlapping.
  • The success field is set to 1 for success, 0 for failure.
  • Heartbeat lists when the job was ran.

When the cloudstack-usage service is restarted this will run checks against the usage_jobs table to determine:

  • If the last scheduled job was ran. If this wasn’t done the job is ran again, i.e. a service startup will run a single missed job.
  • Thereafter the usage job will run at its normal scheduled time.

Usage troubleshooting – general advice

Since this blog post covers topics around adding/updating/removing entries in the cloud and cloud_usage databases we always advise CloudStack users to take MySQL dumps of both databases before doing any work – whether this directly in MySQL or via the usage API calls. 

Database inconsistencies

Under certain circumstances (e.g. if the cloudstack-management service crashes) the cloud.usage_event table may have inconsistent entries, e.g.:

  • STOP entries without a START entry, or DESTROY entries without a CREATE.
  • Double entries – i.e. a VM has two START entries.

The usage logs will show where these failures occur. The fix for these issues is to add/delete entries as required in the cloud.usage_event table, e.g. add a VM.START with date stamp if missing and so on.

Usage service logs

The usage service writes all logs to /var/log/cloudstack/usage/usage.log. These logs are relatively verbose and will outline all actions performed during the usage job:

DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Parsing IP Address usage for account: 2
DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Total usage time 86400000ms
DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Creating IP usage record with id: 3, usage: 24, startDate: Tue Oct 10 00:00:00 UTC 2017, endDate: Tue Oct 10 23:59:59 UTC 2017, for account: 2
DEBUG [usage.parser.VPNUserUsageParser] (Usage-Job-1:null) (logid:) Parsing all VPN user usage events for account: 2
DEBUG [usage.parser.VPNUserUsageParser] (Usage-Job-1:null) (logid:) No VPN user usage events for this period
DEBUG [usage.parser.VMSnapshotUsageParser] (Usage-Job-1:null) (logid:) Parsing all VmSnapshot volume usage events for account: 2
DEBUG [usage.parser.VMSnapshotUsageParser] (Usage-Job-1:null) (logid:) No VM snapshot usage events for this period
DEBUG [usage.parser.VMInstanceUsageParser] (Usage-Job-1:null) (logid:) Parsing all VMInstance usage events for account: 3
DEBUG [usage.parser.NetworkUsageParser] (Usage-Job-1:null) (logid:) Parsing all Network usage events for account: 3
DEBUG [usage.parser.VmDiskUsageParser] (Usage-Job-1:null) (logid:) Parsing all Vm Disk usage events for account: 3

Housekeeping of cloud_usage table

To carry out housekeeping of the cloud_usage.cloud_usage table the “RemoveRawUsageRecords” API call can be used to delete all usage entries older than a certain number of dates. Note – since the cloud_usage table only contains completed parsed entries deleting anything from this table will not lead to inconsistencies – rather just cut down on the number of usage records being reported on.

More information can be found in

The following example deletes all usage records older than 5 days:

# cloudmonkey removeRawUsageRecords interval=5
success = true

Regenerating usage data

The CloudStack API also has a call for regenerating usage records – generateUsageRecords. This can be utilised to rerun the usage job in case of job failure. More information can be found in the CloudStack documentation –

Please note the comment on the above documentation page:  “This will generate records only if there any records to be generated, i.e. if the scheduled usage job was not run or failed”. In other words this API call should not be made ad-hoc apart from in this specific situation.

# cloudmonkey generateUsageRecords startdate=2017-09-01 enddate=2017-09-30
success = true

Quota service

Anyone looking through the cloud_usage database will notice a number of quota_* tables. These are not directly linked to the usage service itself, they are rather consumed by the Quota service. This service was created to monitor usage of CloudStack resources based on a per account credit limit and a per resource credit cost.

For more information on the Quota service please refer to the official CloudStack documentation / CloudStack wiki:


The CloudStack usage service can seem complicated for someone just getting started with it. We hope this blog post has managed to explain the background processes and how to get useful data out of the service.

We always value feedback – so if you have any comments or questions around this blog post please feel free to get in touch with the ShapeBlue team.

About The Author

Dag Sonstebo is a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends his time designing, implementing and automating IaaS solutions based around Apache CloudStack.

CloudStack CA Framework



The CloudStack management server listens by default on port 8250 for agents, and this is secured by one-way SSL authentication using the management server’s self-generated server certificates. While this encrypts the connection, it does not authenticate and validate the connecting agent (client). Upcoming features such as support for container/application cluster services require certificate management, and the emerging common theme is that CloudStack needs an internal certificate authority (CA) that can provide and ensure security and authenticity of client-server connections, and issue, revoke and provision certificates.


To solve these problems, we designed and implemented a new pluggable CA framework with a default self-signed root CA provider plugin, that makes CloudStack a root CA. Initial support is available for securing KVM hosts and systemvm agents, along with communication between multiple management servers. The feature also provides new APIs for issuance, revocation, use, and provision of certificates. For more details, here is the functional specification of the feature.

The new CA framework and root CA provider plugin for CloudStack was accepted by the community recently, and will be available in CloudStack 4.11 (to be released in the near future).

How does it work?

The CA framework injects itself into CloudStack’s server and client components, and provides separation of independent policy enforcement and mechanism implementation. Various APIs such as APIs for issuance, revocation, and provision of certificates plugin into the mechanism implementation provided by a CA provider plugin. In addition, the feature supports the automatic renewal of expiring certificates on an agent or a host, and will alert admins for the same if auto-renewal is disabled or something goes wrong.

The feature ships with a built-in default root CA provider plugin that acts as a self-signed root CA authority, and issues certificates signed by its self-generated and signed CA certificate. It also allows developers to write their own CA provider plugin. If the configured CA provider plugin supports sharing of its CA certificate, a button will appear on the UI to download the CA certificate that can be imported to one’s browser, host, etc.

OK, what happens after we upgrade?

After upgrading CloudStack to a version which has this feature (e.g. 4.11), there will be no visible change and no additional steps are required. The root CA provider plugin will be configured and used by default and the global setting ca.plugin.root.auth.strictness will be set to false to mimic the legacy behaviour of one-way SSL authentication during handshake.

Post-upgrade, the CA framework will set up additional security (by means of keystore and certificates) on new KVM hosts and SystemVMs. If CloudStack admins want to enforce stricter security, they can upgrade and onboard all existing KVM and SystemVM agents, use the provisionCertificate API, set the global setting ca.plugin.root.auth.strictness to true (new CloudStack installations will have this setting set to true by default), and finally restart the management server(s). The SystemVM agents and (KVM) hosts will be in Up and connected state once two-way SSL handshake has correctly verified and authenticated the client-server connections.

Here’s a link to the official CloudStack Admin Documentation for more details.

About the author

Rohit Yadav is a Software Architect at ShapeBlue, the Cloud Specialists, and is a committer and PMC member of Apache CloudStack. Rohit spends most of his time designing and implementing features in Apache CloudStack.

Host-HA for KVM Hosts in CloudStack

, , , , ,


What is HA?

“High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. ”  — Wikipedia

HA in CloudStack is currently supported for VMs only. In order to have it enabled, the service offering of the VM should be HA enabled, otherwise the VMs would not been taken into consideration. There is no HA activity around hosts at this stage, so we don’t have a defense mechanism if a host goes down. All investigations are VM-centric and we’re unable to determine the health of the host or whether it is actually still running the VM. This may result in the VM-HA mechanism starting the same VM on a different host while the faulty host is still running it, which would result in corrupt VMs and disks. Such issues have been seen in large scale KVM deployments.




The Solution

Such issues motivated us to figure out a long term solution to this problem, and we identified that the root of all evil is down to a lack of a reliable fencing and recovering mechanism. A new investigation model had to be introduced in order to achieve this, simply because the VM-centric one wasn’t going to be sufficient. Of-course, it needed to be easy to maintain for administrators.

Setting this as our destination point, we started defining our route to get there.  The first thing that became obvious to us is that CloudStack is missing an OOBM tool to fence and recover hosts. OOBM is the ability to execute power cycle operations to a certain host. So – we developed the CloudStack OOBM Plugin, which implements industry standard IPMI 2.0 provider, supported by most vendors. This way when enabled per host, users would be able to issue power commands such as: On, Off, Reset, etc.
OOBM Feature Specification

Host-HA Granular configuration: offers admins an ability to set explicit configuration on host/cluster/zone level. This way in a large environment some hosts from a cluster can be HA-enabled and some not, depending on the setup and specific hardware that is running.

Threshold based investigator: where the admin can set a specific point of failed investigations, only when it’s exceeded would the host transition is in a different state.

More accurate investigating: Host-HA to uses both health checks and activity checks to take decisions on recovering and fencing actions. Once determined the resource (host) is in faulty state (health checks failed) it runs activity check to figure out if there is any disk activity on the VMs running on the specific host.

Host-HA Design

Host-HA design aims to offer a way to separate policy from mechanism, where individuals are free to use different sets of pluggable tools (ha providers and OOBM tools), while having the same policy applied. Administrator can set the thresholds in global settings and not worry about the mechanism which is going to enforce it. With the resource management service CloudStack admins can manage lifecycle operations per resource and use a kill switch on zone/cluster/host level to disable HA policy enforcement. The framework itself is resource type agnostic and can be extended with any other resources within CloudStack, like for instance load-balancers.

HA Providers are resource specific and are responsible to execute the HA framework and force the applied policy. For example, the KVM HA Provider, as part of this feature, works with KVM Hosts and carries out the HA related activities.
A State-Machine implements event triggers and transitions of a specific HA resource based on which the framework takes the required actions to bring it to the right physical state. For example, if it passes the threshold for being in degraded state it will try to recover it, then the framework will issue an OOBM Restart task which will reset the host power and eventually host will come up. Here’s a list of the States:

Available – the feature is Enabled and Host-HA is available
Suspect – there are health checks failing with the Host
Checking – activity checks are being performed
Degraded – host is passing activity check ratio and still providing service to the end user, but cannot be managed from CloudStack Management
Recovering – the Host-HA framework is trying to Recover the host by issuing OOBM job
Recovered – the Host-HA framework has recovered the Host successfully
Fencing – the Host-HA framework is trying to Fence the host by issuing OOBM job
Fenced – the Host-HA framework has recovered the Host successfully
Disabled –  feature is Disabled for the Host
Ineligible – feature is Enabled, but it cannot be managed successfully by the Host-HA framework (possible OOBM not configured properly)

Please find this image and image of the FSM-Transitions, where all possible transitions are defined with the conditions that are required to move on with next state.

Host-HA on KVM host

Host-HA on KVM hosts is provided by the KVM HA Provider. It uses the STONITH (Shoot the other node in the head) fencing model. It also provides mechanism for activity checks on disks on the shared NFS storage. How does it work? While in a cluster, neighboring hosts are able to perform activity checks on VMs disks running on a faulty (health checks failed) host. The activity check is verifying if there’s any actual activity on the VM disk while the host where it’s running has been reported in bad health, if there is activity then the host would stay in degraded state, if there’s not the HA Framework will transition it to Recovering state and it’ll try to bring it back up. In case it fails the threshold for recovery it will fence it by powering off the machine.

Please checkout the FS for more technical details

Find the pull request on the Apache CloudStack Public Repo

HOST-HA and VM-HA coordination

For KVM HOST HA to work effectively it has to work in tandem with the existing VM HA framework. The current CloudStack implementation focuses on VM-HA as these are the first class entities, while a host is considered to be a resource. The CloudStack manages host states and a rough mapping of CloudStack states vs. the KVM Host HA state is as below:

VM-HA host States KVM Host HA host states
Up Available
Up (Investigating) Suspect/Checking
Alert Degraded
Disconnected Recovering/Recovered/Fencing
Down Fenced

The Host HA improves on Investigation by providing a new way of investigating VM using VM disk activity. It also adds on to the fencing capabilities by integrating with OOBM feature.

In order for VM HA to work correctly and in sync with Host HA it is important that the state of host seen by the two is same as per the above table. VM-HA model has been modified to query the Host-HA states to get the actual host state, when the feature is enabled. It also makes sure VM-HA related activities are not started unless the host has been properly fenced.

About the author

Boris Stoyanov is Software Engineer in testing at ShapeBlue, The Cloud Specialists. Bobby spends his time testing features for the Apache CloudStack Community and for our ShapeBlue clients.

Dynamic Roles in CloudStack

, , , , ,


Managing user roles has been a pain for a while, as the model of having a file that defines roles and their permissions can be hard to comprehend and use. Due to this, not many CloudStack users made any changes to the default harcoded roles and further enhanced roles. Therefore, ShapeBlue has taken the opportunity to rewrite the Roles-Based Access Control (RBAC) unit into a Dynamic Roles model. The changes allows CloudStack Root Admin to create new roles with customised permissions from the CloudStack UI by allowing / denying specific APIs. It deprecates the old fashioned file and transfers all the rules into the DB. This is available in CloudStack 4.9.x and greater.

How it works?

Dynamic RBAC introduces a new tab in the CloudStack Console called Roles. Root Admins by default are able to navigate there and Create / Update all roles. When creating a new role, the Root Admin is able to select the rules that apply to that role, and can define a list of APIs which they could allow or deny for the role. When the user (assigned with a specific role) issues an API request, the backend checks the requested API against configured rules for the assigned role, and the user will only be able to call the API if it’s allowed on the list. If denied or not listed it won’t be possible to call the API.

How to use it?

In this example, let’s assume we want to create a Root Admin that has read-only rights on everything but “Global Settings” in CloudStack.

The following rules configuration shows an example of this custom role, that is only able to view resources. The image bellow shows the rules tab of the custom role called “read-only”. Please observe that only “list*” APIs are allowed, meaning that a user with this role will not be able to delete / update anything within CloudStack, but just use the list APIs. Also, note an addition that is denying any APIs related to configurations (*Configuration). Due to this, the user will not be able to see anything within the “Global Settings”. The order of configuring rules list is also very important – Dynamic Roles Checker iterates the list top-down, so when configuring, it is best practice to shift “Deny” rules to the top. Shifting rules is possible by simply drag-and-dropping the rule by clicking the  button. In this particular case, if “Allow list*” rule was on top of the “Deny *Configurations”, user would be able to see the Global Settings.

When the user hits an API that is Denied, he will be prompted the following generic error message:


OK, what happens if we upgrade to 4.9?

Dynamic Roles is available and enabled by default to all new installations after CloudStack 4.9.x release. If a user upgrades from older version (to 4.9.x or greater), Dynamic Roles will be disabled by default and it will follow the old fashioned way of handling RBAC (ie, with file.) After the upgrade existing deployments of CloudStack can be migrated to Dynamic RBAC by running a migration tool which is part of 4.9 installation. The migration tool is located at the following directory on the management server: /usr/share/cloudstack-common/scripts/util/

When running this tool it will enable Dynamic RBAC, then copy all existing hard-coded roles from file, create the same entities in the database following the data format of Dynamic Roles. Finally, it will rename the file to “”, as a backup file.


python /usr/share/cloudstack-common/scripts/util/ -u cloud -p cloud -h localhost -p 3306 -f /etc/cloudstack/management/

Running this will output the following:
Apache CloudStack Role Permission Migration Tool

(c) Apache CloudStack Authors and the ASF, under the Apache License, Version 2.0 

Running this migration tool will remove any default-role permissions from cloud.role_permissions. Do you want to continue? [y/N]y

The file has been deprecated and moved at: /etc/cloudstack/management/

Static role permissions from have been migrated into the db

Dynamic role based API checker has been enabled! 

And you’re all set, no need to restart management servers! There’s a new global setting introduced with this feature called ‘dynamic.apichecker.enabled’. If it is set to “True” it means the Dynamic Roles is enabled. If by any chance if there is any failure with migration it will roll-back the procedure and will revert to the old hardcoded way of handling RBAC.

After the upgrade the rules of Root Admin Role look like this:

…meaning all APIs are allowed.

Other roles have each individual API rule explicitly added (if available). See part of the Domain Admin rules for reference:

Here’s a link to the official CloudStack Admin documentation

About the author

Boris Stoyanov is Software Engineer in testing at ShapeBlue, The Cloud Specialists. Bobby spends his time testing features for the Apache CloudStack Community and for our ShapeBlue clients.

Granular Access Controls in CloudStack


An oft-cited limitation in Apache CloudStack is the lack of granular access controls.  Historically, when creating an account, there have been four built-in roles to choose from: Root Admin, Resource Admin, Domain Admin, and User.  Unfortunately, these built-in roles have been insufficient for the needs of many organizations, who have resorted to various workarounds.  Thankfully, this will change in the upcoming CloudStack 4.9 release with the addition of the Dynamic Role-Based API Access Checker feature.

Read more

CloudStack 4.7 Metrics View


CloudStack 4.7 (which is due in the coming weeks) will introduce a new metrics view feature throughout the familiar CloudStack interface. We built this functionality to help system architects and admins comprehend resource utilisation and drill into the data to find performance hotspots. Whilst metrics have always been available via the CloudStack API a lot of information hasn’t been readily available in the GUI, and the idea was to make resource usage more easily accessible on the various CloudStack menus without having to click through multiple pages.

Use cases

  • The metrics view allows easy resource usage monitoring from the GUI – especially when trying to narrow down high resource consumers among VM instances and disks, which in turn helps when making VM and storage migration decisions.
  • The metrics view also assists in longer term capacity planning of the CloudStack infrastructure, giving a better overview of the resource usage in zones and clusters.
  • In larger CloudStack estates the metrics view also gives a quick overview of the number of enabled / disabled zones, clusters and hosts.



The metrics view is implemented on the Zone, Cluster, Hosts, Instances, Primary Storage and Storage pages and is accessed via the new metrics button on the menu bar.



On the metrics view pages there are a number of features to make the function usable:

  • Similar to other CloudStack GUI pages all metrics pages have infinite scrolling – as you scroll down more data is loaded.
  • All columns are sortable by clicking on the column heading.
  • Where CloudStack works with warning and disable limits in the global settings any value exceeding these limits will be flagged amber / red.
  • Since the metrics view pages contain a lot of data each topic heading has a collapse button (<<) which collapses the topic column, thereby allowing more page space for the remaining columns. The collapsed column is expanded again by clicking the (>>) button next to the topic heading.
  • All metrics view entries maintain the familiar Quickview button to access context specific functions.
  • All metrics pages (apart from on the Primary storage pool page) operate in a hierarchy, making it easy to drill down through Zone > Cluster > Host > Instances > Volumes.




Collapsing the “Property” topic heading above allows for better view of the remaining columns:



Metrics view pages

Zone view


The Zone metrics view page gives an overview of resource usage in all zones in the CloudStack infrastructure. (Please note the screen shot above is using offline test data, hence does not represent a true view of resource usage and zone states). As described above this page will also flag any values which exceed the warning or disable threshold configured in global settings.

The Zone view page shows:

  • Zone name: clicking on the zone name will navigate to the Clusters view for that specific zone.
  • State: the enabled state of the zone.
  • State: a count of (enabled clusters)/(total clusters) in the zone.
  • CPU usage:
    • Used: the average percentage of the CPU used metric for the clusters in the zone.
    • Deviation: the max CPU usage deviation of any cluster compared to the average across all clusters.
  • CPU allocation:
    • Allocated: the average CPU allocation percentage across all clusters in the zone.
    • Total: the total CPU allocation in GHz for the zone.
  • Memory usage:
    • Used: the average of the memory used metric per cluster in the zone.
    • Deviation: the max memory usage deviation of any cluster compared to the average across all clusters.
  • Memory allocation:
    • Allocated: average of the memory allocated metric across all clusters.
    • Total: the total memory allocation in GB across all clusters in the zone.


Cluster view


Clicking on any zone name in the zone metrics view will bring up the cluster metrics view. The same view can be found under the left hand Infrastructure menu > Clusters. This view provides similar statistics to the zone view – just in the context of each cluster and it’s hosts:

  • Cluster name: clicking on the cluster name will navigate host metrics view for the cluster.
  • State: the enabled state of the cluster.
  • State: a count of (enabled hosts)/(total hosts) in the cluster.
  • CPU usage:
    • Used: the average percentage of the CPU used metric for the hosts in the zone.
    • Deviation: the max CPU usage deviation of any host compared to the average across all hosts.
  • CPU allocation:
    • Allocated: the average CPU allocation percentage across all hosts in the cluster.
    • Total: the total CPU allocation in GHz for the hosts in the cluster.
  • Memory usage:
    • Used: the average of the memory used metric per host in the zone.
    • Deviation: the max memory usage deviation of any host compared to the average across all hosts in the cluster.
  • Memory allocation:
    • Allocated: average of the memory allocated metric across all hosts.
    • Total: the total memory allocation in GB across all hosts in the zone.


Host view


The host metrics view is accessed by drilling down from the clusters metrics view, alternatively from the left hand Infrastructure menu > Hosts > metrics button. (Again please note the screen shot above is using offline test data, hence does not represent a true view of resource usage and hosts states).

The host view presents:

  • Host name: clicking on the host name will navigate to the VM instance metrics view for the host.
  • State: the enabled state of the host.
  • Instances: a count of (running VM instances)/(total VM instances) on the host.
  • CPU usage:
    • Cores: total number of CPU cores on the compute host.
    • Total: total number of CPU resources (GHz) provided by the CPUs on the host. The figure in the bracket shows the CPU overprovisioning factor applied.
    • Used: the total of CPU resources (GHz) currently used.
    • Allocated: the total of CPU resources (GHz) allocated to the instances on the host.
  • Memory usage:
    • Total: the total memory for the host. The figure in brackets shows the memory overprovisioning factor.
    • Used: the total amount of memory used by the VM instances running on the host.
    • Allocated: the total memory allocated to all VM instances on the host.
  • Network usage:
    • Read: the cumulated network read bandwidth utilisation in GB.
    • Write: the cumulated network write bandwidth utilisation in GB.


VM instance view


The VM instance view is accessed by drilling down from the hosts view, alternatively via the left hand Instances menu.

The view provides the following data:

  • Name: the VM instance name. Clicking on the VM name will bring up the storage metrics view for the specific VM instance.
  • State: shows the running state of the VM.
  • IP address: the primary IP address for the VM instance.
  • Zone: the name of the zone the VM is running in.
  • CPU usage:
    • Cores: number of CPU cores allocated to the VM.
    • Total: total of CPU resources (GHz) allocated to the VM, i.e. (number of cores) x (CPU speed).
    • Used: current CPU usage (GHz) of the VM.
  • Memory usage:
    • Allocated: the total amount of memory allocated to the VM instance.
  • Network usage:
    • Read: the cumulated network read bandwidth utilisation by the VM (GB).
    • Write: the cumulated network write bandwidth utilisation by the VM (GB).
  • Disk usage: this view will only show data when the underlying storage system provides statistics.
    • Read: accumulated disk reads (MB) for the VM.
    • Write: accumulated disk writes (MB) for the VM.
    • IOPs: total number of IOPs (read + write) for the VM.


Volume usage view


The storage volume metrics view can be accessed from the left hand Storage menu, which will show all storage volumes managed by CloudStack, or alternatively by drilling down from the VM instances metrics view, which will show only the volumes for the specific VM.

This view shows:

  • Name: name of the storage volume (there is no further drill down from this page).
  • State: shows the attached state of the volume.
  • VM Name: the VM instance name the volume is attached to. This value is blank when the volume is detached.
  • Size: volume size (GB).
  • Type: VM root disk or a data disk.
  • Storage pool: lists the storage pool where the disk is stored.



 Primary storage view


The Primary Storage metrics view can only be accessed from the left hand Infrastructure menu > Primary storage > metrics button.

The view lists:

  • Name: primary storage pool name. Clicking the pool name will bring up the volume metrics view for all volumes stored on this primary storage pool.
  • Property:
    • State: up/down state of the primary storage pool.
    • Scope: lists the scope of the pool, i.e. either “cluster” or “zone”.
    • Type: storage type for the pool, e.g. NetworkFilesystem, VMFS, etc.
  • Disk:
    • Used: total GB used in the storage pool.
    • Total: the total capacity of the pool. The figured in brackets lists the storage overprovisioning factor.
    • Allocated: the current amount of allocated storage (GB) in the pool.
    • Unallocated: the current amount of unallocated storage (GB) in the pool, taking overprovisioning into account.



All in all we hope the metrics view will be a really useful feature for anyone using CloudStack, especially those with larger CloudStack estates. As always we’re happy to receive feedback , so please get in touch with any comments, questions or suggestions.


About The Author

Dag Sonstebo is  a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends most of his time designing and implementing IaaS solutions based on on Apache CloudStack.


XenServer Native HA with CloudStack

, , , ,

Update – Following community feedback, Timeout Settings have now been added to the script

Update – The HA settings in this post also apply to XenServer 6.5.0 onwards

Warning: If you have applied Hotfix XS62ESP1004 to your XenServer 6.2 infrastructure and have not enabled Pool HA, in the event of your Pool Master going down, a Slave Host will NOT take over as Pool Master and you will lose connectivity via XenCenter. All Hosts will go into Alert state within CloudStack so you will have reduced functionality within your CloudStack Cloud. This article covers how to correctly configure Pool HA and how to manage your XenServers once HA is enabled.


Traditionally when using Citrix XenServer with Apache CloudStack / Citrix CloudPlatform (simply referred to CloudStack for the rest of this article) the XenServer HA Feature was not enabled as CloudStack took care of all HA events, however the release of XS62ESP1004 changed a few things.

With the release of XS62ESP1004, the way XenServer handles the loss of its Pool Master has changed and rather than having to manually promote a Slave to become a Pool Master, the XenServer Pool can now ‘Self Heal’ which is great news, however you need to do some additional configuration for this magic to happen.

The important thing to understand is that CloudStack still takes care of all the VM HA events so we do not want to give XenServer control over the VMs. In fact all we are doing is enabling ‘Pool Master HA’ so that in the event of a failure of the Pool Master, a new Pool Master is elected.

Configuring Pool HA

The first step is to ensure all your Hosts in the Pool have XS62ESP1004 installed. If deploying a new batch of XenServers for use with CloudStack its best practise to install all appropriate Hotfixes before adding the Hosts into CloudStack.

Once your Hosts are in a Pool and all networking and bonds are configured etc you need to add a dedicated Storage Repository which will be used only by the Pool Master HA mechanism. I generally name this ‘MGMT-HA’ as its handling the HA of the Pool Management elements etc. This can be a NFS or iSCSI mount but only needs to be 1 GB in size. The important thing is that it is only used for HA and is not added into CloudStack as Primary Storage.

Once the SR is configured and online, the next step is to enable HA. This ‘could’ be enabled using the XenCenter UI, however there is a risk you will inadvertently enable VM HA which could cause you all sorts of problems. In addition you will probably be disabling and re-enabling this on a regular basis so putting together a simple script to enable / re-enable HA when required is a much better option.

To allow for failover of network storage controllers the ha timeout should be set to a minimum of 90 secs, this is to prevent premature self fencing.  If after testing this timeout is not long enough, increase the timeout and test again.

Add the following into a simple bash script and place on your hosts (you can run it from any Host, not just the Pool Master) Note how it references the SR name ‘MGMT-HA’ so if you use a different naming convention simply update the script to match your SR name.

MGMTHAUUID=`xe sr-list name-label=MGMT-HA --minimal`
xe pool-ha-enable heartbeat-sr-uuids=$MGMTHAUUID ha-config:timeout=90
echo "Pool HA is now enabled"


Now simply run the script and wait for the confirmation message “Pool HA is now enabled” it can take a couple of minutes so be patient.

When reviewing the settings within XenCenter, you could be forgiven for thinking that HA is not enabled, when in fact it is, but only for the Pool Master and not the VMs which are still protected by CloudStack.


Having Pool HA enabled unfortunately does add some extra complication to the management of your XenServer Pool so please read on to learn how to deal with failures and how to perform planned restarts etc.

Automatic Handling of Host Failures

If the Pool Master goes down a Slave will take over and all VMs which were running on the failed Pool Master will be automatically restarted on alternate Hosts, but only as long as their Compute Offering had the ‘Offer HA’ feature enabled.

CloudStack still takes care of the restarting of the VMs and will initiate the restart only after the new Pool Master has taken over and the time out controlled by the global setting ‘alert.wait’ has expired. In the system used for the following tests, alert.wait was set to 60 seconds. The default value is ‘blank’ which results in a delay of 30 mins once the Host has been detected as being down by CloudStack.

The following timings were observed on a production system which was undergoing testing. The system was running Apache CloudStack 4.3.2, the Hosts were CISCO UCS running XenServer 6.2 with all available updates applied (up to XS62ESP1014). Storage was provided by SolidFire and networking was 10GB CISCO. Ten Windows 2008 R2 VMs were running on the system along with the usual array of System VMs. Each Cluster had 4 Hosts and the tests were performed on two Clusters at the same time. All VMs were running on the Pool Masters to cause maximum impact from the simulated failure. Timings are in Minutes and Seconds from the start of the test:

Start – Pool Master Failure simulated by killing power to Hosts via CIMC

01:15 – XenCenter detected host failures and lost connection

03:00 – XenCenter reconnected to new Pool Masters

03:15 – CloudStack confirmed failure of Hosts

03:30 – CloudStack initiated HA as VMs have now been down for over the configured 60 secs set by alert.wait

05:30 – All ten Windows VMs and System VMs were reported by CloudStack as ‘Running’

So in summary, following a failure of the XenServer Pool Masters in two Clusters, which is a worst case scenario, within 5 mins 30 secs all 10 Windows VMs, 4 Virtual Routers, and 3 System VMs were back online without any administrator input.

Recovering from Host Failures

Once the ‘failed’ pool masters are restored by simply powering them back on, they should automatically reconnect to the Pool

However if the host fails to automatically reconnect, and is not accessible via XenCenter, to re-enable them you need to disable HA on the Pool and also disable HA on the Host, which still thinks it is a Pool Master and will have gone into ‘Emergency Mode’.   To do this first disable HA on the Pool using XenCenter, then run the following command on the recovered Host

xe host-emergency-ha-disable --force


After a short delay the Host will reconnect to the Pool as a Slave and then come back online within CloudStack.

If a Slave Host goes down, after it is restored, before it will re-connect to the Pool, HA on the Pool ‘may’ need to be disabled, but there is no need to force disable HA on the Host as it should not go into Emergency Mode as the Pool Master will be accessible when the Host boots.

Once the failed hosts is back online, the Pool HA should be re-enabled by re-running the script which enables HA

Managing XenServers when Pool HA is enabled

Occasionally you will need to restart or Shutdown a XenServer Host which is part of a CloudStack Cluster and belongs to a Pool which has Pool HA enabled. The following sections list the correct actions to be taken:

To perform a controlled restart of a Slave Host:

1. Place into Maintenance Mode within CloudStack
2. Restart the Host
3. Exit Maintenance Mode in CloudStack

To perform a controlled restart of a Pool Master Host:

1. Place into Maintenance Mode within CloudStack
2. Disable Pool HA
3. Place the Host into Maintenance Mode with XenCenter to force the promotion of a Slave to Pool Master
4. Restart the Host
5. Once Host has fully booted and is online in XenCenter re-enable Pool HA
6. Exit Maintenance Mode in XenCenter
7. Exit Maintenance Mode in CloudStack

To shutdown a Host for an extended period of time: (Pool Masters should be demoted to a Slave)

1. Place into Maintenance Mode within CloudStack
2. Disable Pool HA
3. Place into Maintenance Mode with XenCenter
4. Shutdown the Host
5. Enable Pool HA
6. Exit Maintenance Mode in CloudStack

When you wish to bring the Host back on line

a. Disable Pool HA
b. Power on Host, then once Host has fully booted and is online in XenCenter re-enable Pool HA
c. Exit Maintenance Mode in XenCenter
d. Exit Maintenance Mode in CloudStack


After installing XS62ESP1004 you MUST enable HA on your XenServer Pools, and this results in a different approach to managing your XenServer resources.

About the Author

Geoff Higginbottom is CTO of ShapeBlue, the strategic cloud consultancy. Geoff spends most of his time designing private & public cloud infrastructures for telco’s, ISP’s and enterprises based on CloudStack.