Tag Archive for: CloudStack Feature Deep Dive

Introduction

CloudStack vSphere integration has not kept up with the evolution of vSphere itself, and several functions can be performed natively by vSphere much more efficiently than by CloudStack. vSphere also has additional features which would be beneficial to the operators of vSphere based CloudStack clouds.

This feature introduces support in CloudStack for VMFS6, vSAN, vVols and datastore clusters. Also, vSphere storage policies are tied with compute and disk offerings to improve linking offerings with storages, and CloudStack will allow inter-cluster VM and volume migrations, meaning that running VMs can now migrate along with all volumes across clusters. Furthermore, storage operations (create and attach volume; create snapshot / template from volume) are improved in CloudStack by using the native APIs of vSphere.

Storage types and management concepts

CloudStack supports NFS and VMFS5 storage for primary storage, but vSphere has supported other storage technologies for some time now (VMFS6, vSAN, vVols and datastore clusters). vSphere also has ‘vStorage API for Array Integration’ (VAAI), which enables vSphere integration with other vendors’ storage arrays on different storage technologies. Each storage technology is designed to serve a slightly different use case, but ultimately, they are all designed to improve the flexibility, efficiency, speed, and availability of storage to vSphere hosts. In addition to the storage types, there are storage management concepts in vSphere such as vSphere Storage Policies, which are not available in CloudStack.

Let us briefly go through these new technologies and concepts that are supported in CloudStack and vSphere.

VMFS6

VMFS6 is the latest VMware File System version (introduced with vSphere 6.5), and a few enhancements were introduced over VMFS5. The major differences are:

  • SESparse disks which provide improved space efficiency are now default disk types in VMFS6
  • Automatic space reclamation allows vSphere to reclaim dead or stranded space on thinly provisioned VMFS volumes in storage arrays

vSAN

vSAN was introduced with vSphere 5.5 and is a software-defined, enterprise storage solution that supports hyper-converged infrastructure (HCI) systems. vSAN is fully integrated with VMware vSphere, as a distributed layer of software within the ESXi hypervisor.

vVols

Virtual volumes (vVols), introduced with vSphere 6.5, is an integration and management framework for external storage providers, enabling a more efficient operational model optimized for virtualized environments and centred on the application instead of the infrastructure. vVols uniquely shares a common storage operational model with vSAN. Both solutions use storage policy-based management (SPBM) to eliminate storage provisioning, and uses descriptive policies at the VM or VMDK level.

Datastore clusters

A datastore cluster is a collection of datastores with shared resources and a shared management interface. After a datastore cluster is created, vSphere Storage DRS can be used to manage storage resources. When a datastore is added to a datastore cluster, the datastore’s resources become part of the datastore cluster’s resources.

Storage policies

Storage policies have become vSphere’s preferred method of determining the best placement of a disk image when differing ‘qualities’ of storage are available, and are simply a set of filters. For instance, a storage policy may say that the underlying disk must be encrypted, when a user specifies that storage policy, they would only be returned a list of data stores which have encrypted disks to choose to place the VM’s disk on. Storage policies are effectively a prerequisite for the use of the vSAN and the vVOLs.

GUI or API support

CloudStack has new APIs, modified existing APIs and UI support so that vSphere’s advanced capabilities can be used. To support different storage types for primary storages, the UI has changed.

Storage types Previously the only options for storage protocol type in CloudStack (while adding primary storage) were NFS, VMFS or custom. A new generic type called “presetup” has been added (for VMware) to add storage types VMFS5, VMFS6, vSAN or vVols. When a presetup datastore is added to CloudStack, the management server automatically identifies the storage pool type and saves it to the database.

To add a datastore cluster (which must have already been created on vCenter) as a primary storage, there is another new storage protocol type called “Datastore Clusters”.

To add one of the new Primary Storage types:

  1. Infrastructure tab -> Primary Storage -> Click “Add Primary Storage”
  2. Under “Protocol” the following options are available:
    • nfs
    • presetup
    • datastore cluster
    • custom

 

3. When “PreSetup” is selected as the storage protocol type, specify the vCenter server, datacenter and datastore details as shown below:

Storage policies

  • New APIs are introduced to import and list already imported storage policies from vCenter.
  • Storage policies are imported automatically from vCenter when a VMware zone is added in CloudStack.
  • Storage policies are re-imported and with CloudStack database whenever the “updateVmwareDc” API or the “importVsphereStoragePolicies” API are called. During re-import, any new storage policies added at vCenter will be newly imported to CloudStack and any storage policy deleted at vCenter will be marked as removed in CloudStack database.
  • Another new API “” is added, to list the compatible storage pools for an imported storage policy.
API New parameters
importVsphereStoragePolicies zoneid: id of the zone from which storage policies have to imported from corresponding vSphere
listVsphereStoragePolicies zoneid: id of the zone to list storage policies in it.
listVsphereStoragePolicyCompatiblePools zoneid: id of the zone to list storage pools in it which are compatible with storage policy

policyid: UUID of the storage policy

 

Existing APIs “createDiskOffering” and “createServiceOffering” are modified to bind vSphere storage policy to the offerings using a new parameter “storagepolicy” which takes policy UUID as input. In the GUI, while creating a service or disk offering, after selecting a specific VMware zone, already imported storage policies in that are zone are listed as below:

 

  • When VMs are deployed in VMware hosts, a primary storage pool is selected which is compliant with the storage policy defined in the offerings. For data disks, the storage policy defined in the disk offering will be used, and for root disks, the storage policy defined in the service offering will be used.

Implementation

As mentioned above, CloudStack now supports adding the new storage types and datastore clusters under one protocol category called “presetup”. Following are the steps that management server takes while adding a primary storage for the various storage protocols:

NFS storage

Management server mounts the NFS storage to the ESXi hosts by sending create NAS datastore API to vCenter.

PreSetup

  • Management server assumes that the provided datastore is already created on vCenter.
  • Management server checks the access of the datastore with name and vCenter details provided
  • Once a datastore with the provided name is found, management server fetches the type of datastore and adds the protocol type to “storage_pool_details” table in database

 Datastore cluster

Since datastore cluster on vCenter is a collection of datastores, let us call the actual datastore cluster the parent datastore and the datastores inside the cluster as child datastores. CloudStack handles a datastore cluster by adding it as a single primary storage. The pools inside the cluster are hidden and won’t be available individually for any operation.

There were some implementation challenges to directly add it as a primary storage. On vCenter a datastore cluster looks similar to the other datastore types as shown below.

In the underlying vSphere implementation, the type of all datastores other than datastore cluster is “Datastore” whereas the type of datastore cluster is “StoragePod”. vSphere native APIs related to storage operations are applicable only for the types “Datastore” and not to “StoragePod”. Due to this, the existing design of adding a datastore as a primary storage in CloudStack did not work for datastore cluster. The challenge comes here on how CloudStack abstracts the datastore cluster as a primary storage directly as a single entity. This is achieved by:

  • When a datastore cluster is added as a primary storage in CloudStack, it auto imports the child datastores inside the cluster as primary storages in CloudStack; eg.: when datastore cluster DS1 with 2 child datastores is added into CloudStack, the management server will create 3 primary storages (1 parent datastore and 2 child datastores) and note the child datastore’s parent in database.
  • A new column “parent” is introduced in “storage_pool” table in database.
  • “parent” column of child datastores is pointed to the parent datastore.
  • Only parent datastore is made visible to admins and child datastore are hidden making the datastore cluster acts like a black box.
  • Whenever a storage operation is performed on a datastore cluster, management server chooses one of the child datastores for that operation.
  • Any operation on a datastore cluster in fact performs that operation on all its child datastores. For example, if a datastore cluster is put in maintenance mode then all the child datastores will be put in maintenance mode and upon any failure, it reverts to the original state and throws error on original operation.

Following are the APIs where datastore cluster implementation is involved (storageid is passed as a parameter):

  • updateConfiguration – configures the value of global setting passed to the datastore cluster to all its child datastores
  • listSystemVms – lists all system VMs located in all child datastores
  • prepareTemplate – prepares templates in one of the available child datastores
  • listVirtualMachines – lists all virtual machines located in all child datastores
  • migrateVirtualMachine – migrates a VM to one of the available child datastores
  • migrateVolume – migrates a volume to one of the available child datastores
    • On a datastore cluster which is already added as a primary storage if any storage pool needs to be added or removed from the datastore cluster then we will need to also remove the primary storage from CloudStack and re-add it after required modifications on the storage (this has been resolved in CloudStack 4.15.1).

Storage Policies

On vCenter, storage policies act like a filter and control which type of storage is provided for the virtual machine, and how the virtual machine is placed within storage. So the best fit for storage policies in CloudStack is in disk offering and compute offering, since these offerings are also used to find the suitable storage and resources during virtual machine deployment.

 An admin can select an imported storage policy while creating a disk or service offering. Based on the storage policy, the corresponding disk is placed in the relevant storage pool which is in compliance with the storage policy, and the VM and disk are configured to enforce the required level of service based on the policy.

For example:

  • If a compute offering is created with “VVol No Requirement Policy” (the default storage policy for vVols), CloudStack tries to keep the root disk of the virtual machine in vVols primary storage, and the VM is also configured with that policy. Upon on any other storage operation (ie. volume migration) this storage policy will be taken into consideration for best placement of the VM and root disk.
  • If a disk offering is created with any storage policy, the same applies to the data disk.

vSphere related changes

“fcd” named folder in the root directory of storage pool

  • Previously any data disk was placed in the root folder of the primary storage pool. This is possible for NFS or VMFS5 storage types, but there is a limitation with the vSAN storage type as it does not support storage of user files directly in the root of the directory structure. Therefore, a separate folder is now created on all primary storage pools with the name “fcd”.
  • Since the storage operations are made independent of storage type, the “fcd” folder is created on all storage types.
  • The folder name is “fcd” because when the vSphere API is used to create first class disk, vCenter automatically creates a folder called ‘fcd’ (unless it already exists) and creates disk in that folder.

vVols template or VM creation with UUID as name

  • When deploying a VM from OVF template on vCenter, the UUID cannot be used as the name of VM or template. CloudStack seeds templates from secondary to primary using template UUID and uses a newly generated UUID for creating worker VM. So in case of vVols, datastore VM or template creation operations a suffix “cloud.uuid-” is added to the UUID whenever it is used:

vVols disk movement

vVols does not allow the disk to move from where it is created or from where it is placed using vSphere native APIs. If a disk is moved from its intended location then the pointer to the underlying vVols storage is lost and disk will be inaccessible. Therefore, following are the changes made with respect to vVols storage pool to avoid disk movements:

VM creation:

The VM will be cloned from the template on the vVols datastore with CloudStack’s VM internal name directly (Eg. i-2-43-VM). Prevoiusly CloudStack used to clone VM from template with root disk name (Eg. ROOT-43) and move volumes from root disk name folder to VM internal name.

Volume creation:

  • When a volume is first created and placed in a folder on the storage, it will not be moved from that folder whether it is attached to a VM or detached from a VM.

Conclusion

As of LTS version 4.15, CloudStack supports vSAN, vVols, VMFS5, VMFS6, NFS, datastore clusters, storage policies, and will also operate more like vSphere does natively to better manage them.

Vendors of virtual appliances (vApp) for VMware often produce ‘templates’ of their appliances in an OVA format. An OVA file will contain disc images, configuration data of the virtual appliance, and sometimes a EULA which must be acknowledged.

The purpose of this feature is to enable CloudStack to mimic the end-user experience of importing such an OVA directly into vCenter, the end result being a virtual appliance deployed with the same configuration data in the virtual machines descriptor (VMX) file as would be there if the appliance had been deployed directly through vCenter.

The OVA will contain configuration data regarding both hardware parameters required to start a virtual appliance, and software parameters which the virtual appliance will be able to read during instantiation. Generally, the software parameters will take the form of questions posed to the end-user, the answer to which is passed to the virtual appliance. Hardware parameters may either be set as a result of a question to the end-user (i.e. “Would you like a small, medium, or large appliance?”), or they may be passed directly into the virtual machines descriptor (VMX) file.

CloudStack version 4.15 includes full support for vApp OVA templates. Users will be able to deploy vApps in CloudStack, resulting in an appliance deployed with the same configuration data as if the VM has been deployed directly through vCenter.

Glossary:

The following terms will be used throughout this blog:

  • Workflow: The VM deployment cycle procedure / tasks on CloudStack for VMware environments.
  • ‘Deploy-as-is’: A new workflow / paradigm in which the deployed VMs inherit all the pre-set configurations and information from the template. In other words, ‘deploy-as-is VMs’ are clones of the templates from which they are deployed.
  • ‘Non-deploy-as-is’: The usual workflow on CloudStack for VMware environments prior to version 4.15.

High-level CloudStack VMware workflow refactor:

In this section, we will deep dive on the improvements and refactor in CloudStack to support vApps – from the usual VMware workflow to the ‘deploy-as-is’ workflow.

The default behavior for templates registered from CloudStack 4.15 and onwards will be the ‘deploy-as-is’ workflow. The main difference between this and the existing workflow is that ‘deploy-as-is’ lets CloudStack simplify the VM deployment process on VMware by using the template information about guest OS, hardware version, disks, network adapters and disk controllers to generate new VMs, without much user intervention.

As a vApp template can have multiple ‘configurations’, the copy from secondary storage to primary storage must be extended:

  • The ‘configuration’ ID selected by the user must be considered when copying a template from secondary to primary storage
  • The same template can now have multiple “versions” in the same primary storage, as the users can select from multiple OVF ‘configurations’ from the same template.
  • This is reflected on a new column ‘deployment_option’ in the table ‘template_spool_ref’

Prior to version 4.15, these were the steps involved in VMware VM deployment:

  • Deploy OVF template from Secondary to Primary storage into a ‘template VM’
  • Create the VM ROOT disk:
    • Clone the template VM on Primary storage into a temporary VM
    • Detach the disk
    • Remove the temporary VM
    • Deploy a VM from a template:
      • 4.1 Create a blank VM
      • Attach the ROOT disk created previously and any data disk.

As mentioned earlier, all templates registered from CloudStack 4.15 will use the improved workflow (‘deploy-as-is’). This extends but simplifies the previous workflow:

  • VMs can have multiple ROOT disks (as appliances may need multiple disks to work)
  • All the information set in the template is honored, and any user-specified setting such as Guest OS type, ROOT disk controller, network adapter type, boot type and boot mode are ignored.
  • OVF template deployed from Secondary to Primary storage matching the configuration ID selected by the user, creating a ‘template VM’ of a specific OVF ‘configuration’ on the Primary Storage
  • Create the VM ROOT disks:
    • Clone the template VM on Primary storage into a final user VM
    • Deploy a VM from a template (user VM exists by now): § Use the cloned VM to get its disk info, and then reconcile the disk information in database § The VM’s disk information is obtained by the SSVM, when CloudStack allocates volumes for virtual machines (allocation) in database. The SSVM uses the template ID and the selected ‘configuration’ to read the OVF file in secondary storage and retrieving the disks information.
  • Attach any data disk.

When it comes to the original workflow (for templates registered before 4.15), note that the use of a blank VM means dropping all the information from the source template. As the resulting VM was not a clone of the template (except for the ROOT disk), all the information that the template contained was not considered: guest OS type, hardware version, controllers, configurations, vApp properties, etc. With the new workflow we copy all the information available from the template.

As the information is now obtained from the template itself, CloudStack no longer requires some information at the template registration time. Instead, it obtains this information directly from the OVF descriptor file, meaning there is no need to select a ROOT disk controller, network adapter type or guest OS.

Initially, the template is registered with a default guest OS ‘OVF Configured OS’ until the template is successfully installed. Once installed, the guest OS is displayed from the information set in the OVF file.

  • To provide a complete list of supported guest OS, the tables ‘guest_os’ and ‘guest_os_hypervisor’ have been populated with the information from: https://code.vmware.com/apis/704/vsphere/vim.vm.GuestOsDescriptor.GuestOsIdentifier.html
  • There is also no need to select BIOS or UEFI (or Legacy vs. Secure boot mode) during te VM deployment time, and no ‘hardcoding’ the VM HW / VMX version to the ‘latest supported’ by the destination ESXi host (old workflow behaviour).
  • This information is obtained from the OVF file at template registration – i.e. after the template is downloaded to the secondary storage, it is extracted and the OVF file is parsed.
  • All the parsed information is sent to the management server and is persisted in database (in a new table ‘template_deploy_as_is_details’)

Disclaimer

Some of this functionality was introduced in version 4.14. However, as the only supported sections were the user-configurable properties, this support was extremely limited and should not be used with vApp templates. Any templates registered prior to upgrading to or installing CloudStack version 4.15 will not support vApp templates. These templates will continue working as they were before the upgrade, following the usual workflow for VMware deployments on CloudStack.

The default behavior for templates registered from version 4.15 and onwards is the “deploy-as-is” workflow.

vApp templates format

An appliance OVA contains a descriptor file (OVF file) containing information about the appliance, organized in different sections in the OVF descriptor file. Most sections used by the appliances are not set on ‘non-deploy-as-is’ templates.

The most common sections set on appliances are:

  • Virtual hardware and configurations. The appliance can provide different deployment options (configurations) where each one of them has different hardware requirements such as CPU number, CPU speed, memory, storage, networking.
  • User-configurable parameters. The appliance can provide a certain number of user-configurable properties (also known as vApp properties) which are often required for the initial configuration of an appliance. It is possible to define required parameters which must be set by the user to continue with the appliance deployment.
  • The license agreements. The appliance can define a certain number of license agreements which must be accepted by the user to continue with the appliance deployment.

For further information on the full OVF format and its syntax, please visit the OVF Specification Document: https://www.dmtf.org/sites/default/files/standards/documents/DSP0243_2.1.1.pdf

vCenter – Deploy a vApp from OVF

Before this feature, the only feasible way to deploy an appliance was directly through vCenter, using the ‘Deploy from OVF’ operation. However, CloudStack would not be aware of this VM and it would therefore not be managed by CloudStack. As this feature enables user to mimic the deployment experience of appliances through CloudStack, we will briefly describe the vCenter deployment experience with an example appliance containing all the sections described in the previous section (vApp templates format). Before starting the deployment of a virtual appliance, vCenter firstly displays the end-user license agreements which the user must accept before continuing with the deployment wizard:

The next step displays the different configurations for the appliance and their respective hardware requirements. The user must select only one of the configurations available to proceed to the next step:

The appliances are preset with a certain number of network interfaces, which must be connected to networks (either to different networks or to the same network). Each network interface shows a name to help the user connecting the interfaces to the appropriate networks.

The last step involves setting some properties. The property input fields can be of a different type: input fields, checkboxes or dropdown menus. vCenter displays the properties and allows the user to enter the desired values:

vCenter is now ready to deploy the appliance.

CloudStack – Deploy a vApp

The VM deployment wizard in CloudStack is enhanced to support deploying vApps in version 4.15. This is achieved by examining the OVF descriptor file of the ‘deploy-as-is’ templates and presenting the information that needs user input as new sections in the VM deployment wizard.

New VM deployment sections

With the new ‘deploy-as-is’ workflow, when CloudStack detects that a template contains the special sections described above, it presents them to the user in a similar way to vCenter, but in a different order, extending the existing VM deployment wizard steps.

The VM deployment wizard requires the user to select a template from which the VM must be deployed. If the selected template provides different OVF ‘configurations’, then the existing ‘Compute Offering’ step is extended. Instead of displaying the existing compute offerings to the user, CloudStack now displays a new dropdown menu showing all available configurations. The user must select one and a compatible compute offering:

When the user selects a configuration from the Configuration menu, then the list of compute offerings is filtered, displaying only the service offerings (fixed or custom) matching the minimum hardware requirements defined by the selected configuration.

In the case of custom offerings, CloudStack automatically populates the required values with all the information available from the configuration (for number of CPUs, CPU speed and memory). If CloudStack does not find information for some of these fields, then the user must provide a value.

The ‘Networks’ step is also extended, displaying all the network interfaces required by the appliance.
If the template contains user-configurable properties, then a new section ‘vApp properties’ is displayed:

If the template contains end-user license agreements, then a new section ‘License agreements’ is displayed, and the user must accept the license agreements to finish the deployment.

Conclusion

CloudStack version 4.15 introduces the deploy-as-is workflow for VMware environments, making it the default workflow for every new template. The previous workflow is still preserved, but only for templates registered prior to version 4.15.

In CloudStack, secondary storage pools (image stores) house resources such as volumes, snapshots and templates. Over time these storage pools may have to be decommissioned or data moved from one storage pool to another, but CloudStack isn’t too evolved when it comes to managing secondary storage pools.

This feature improves CloudStack’s management of secondary storage by introducing the following functionality:

  • Balanced / Complete migration of data objects among secondary storage pools
  • Enable setting image stores to read-only (making further operations such as download of templates or storage of snapshots and volumes impossible)
  • Algorithm to automatically balance image stores
  • View download progress of templates across datastores using the ‘listTemplates’ API

Balanced / Complete migration of data objects among secondary storage pools

To enable admins to migrate data objects (ie. snapshots, templates (private) or volumes) between secondary storage pools an API has been exposed which supports two types of migration:

  • Balanced migration – achieved by setting ‘migrationtype’ field of the API to “Balance”
  • Complete migration –achieved by setting ‘migrationtype’ field of the API to “Complete”

If the migration type isn’t provided by the user, it will default to “Complete”.

Usage:

migrate secondarystoragedata srcpool=<src image store uuid> destpools=<array of destination image store uuids> migrationtype=<balance/complete>

Balanced migration:

The idea here is to evenly distribute data objects among the specified secondary storage pools. For example, if a new secondary storage is added and we want data to be placed in it from another image store, the “Balanced” migration policy would be most suitable.

As part of this policy there is a Global setting “image.store.imbalance.threshold” which helps in deciding when the stores in question have been balanced. This threshold (by default, set to 0.3) basically indicates the ideal mean standard deviation of the image stores. Therefore, if the mean standard deviation is above this set threshold, migration of the selected data object will proceed. However, if the mean standard deviation of the image stores (destination(s) and source) is less than or equal to the threshold, then the image stores have reached a balanced point and migration can stop. As part of the balancing algorithm, we also check the mean standard deviation of the system before and after the migration of a specific file and if the standard deviation increases then we omit the particular file and proceed further as its migration will not provide any benefit.

Complete migration:

Complete migration migrates a file if the destination image store has sufficient free capacity to accommodate the data object (used capacity is below 90% and is larger than the size of the file chosen). Also, during complete migration the source image store is set to “read-only”, in order to ensure that the store is no longer selected during any other operation involving storage of data in image stores.

  • Source and destination image stores are valid (ie., are NFS based stores in the same datacenter)
  • Validity of the migration type / policy passed
  • Role of the secondary storage(s) is “Image”
  • Destination image stores don’t include the source image store
  • None of the destination image stores should be set to read-only
  • There can be only one migration job running in the system at any given time
  • If choice of migration is “Complete” then there should not be any files that are in Creating, Copying or Migrating states

Furthermore, care has been taken to ensure that snapshots belonging to a chain are migrated to the same image store. If snapshots are created during migration, then:

  • If the migration policy is “complete” and the snapshot has no parent, then it will be migrated to the
  • If the snapshot has a parent then the snapshot will be moved to the same image store as the parent

Another aspect of the migration feature is scaling of Secondary storage VMs (SSVMs) to prevent all migrate jobs being handled by one SSVM, which may hamper the performance of other activities that are scheduled to take place on it. The relevant global settings are:

  • max.migrate.sessions. New. Indicates the number of concurrent file transfer operations that can take place on an SSVM (defaults to 2)
  • ssvm.count. New. maximum number of additional SSVMs that can be spawned up to handle the load. (defaults to 5). However, if the number of hosts in the datacenter is less than the max count set, then the number of hosts takes precedence
  • vm.auto.reserve.capacity. Existing. Should be set to true (default) if we want scaling of SSVMs to when the load increases

Additional SSVMs will be created when half of the total number of jobs have been running for more than the duration defined by the global setting max.data.migration.wait.time (default 15 minutes). Therefore, ifmigrate job has been running for more than 15 mins then a new SSVM is spawned and jobs can be scheduled on it.

These additional SSVMs will be automatically destroyed when:

  • The migration job has reached completion
  • The total number of commands in the pipeline (as determined by the cloud.cmd_exec_log table) is below the defined threshold
  • There are no jobs running on the SSVM in question

UI Support:

As well as using the API (cloudmonkey / cmk), support has been added in the new Primate UI.

Navigate to: Infrastructure → Secondary Storages At the top right corner, click on migrate button:

 

 

 

 

 

 

 

 

 

Enable setting image stores to read-only

A secondary storage pool may need to be set to read-only mode, to prevent downloading objects onto it. This could prove useful when decommissioning a storage pool. An API has been defined to enable setting a Secondary storage to read only:

update imagestore id=<image_store_id> readonly=<true/false>

It is possible to filter out image stores based on read-only / read-write permissions using the API ‘listImagestores’.

Algorithm to automatically balance image stores

Currently the default behaviour of CloudStack is to choose an image store with the highest free capacity. There is a new global setting: “image.store.allocation.algorithm”, which by default is set to “firstfitleastconsumed”, meaning that it returns the image stores in decreasing order of their free capacity. Another allocation option is ‘random’, which returns image stores in a random order.

View download progress of templates across datastores using the ‘listTemplates’ API

The “listTemplates” API has been extended to support viewing download details: progress, download status, and image store. For example:

This feature will be available as of Apache CloudStack 4.15, which will be an LTS release.

For a while, the CloudStack community has been working on adding support for containers. ShapeBlue successfully implemented the CloudStack Container Service and donated it to the project in 2016, but it was not completely integrated into the codebase. However, with the recent CloudStack 4.14 LTS release, the CloudStack Kubernetes Service (CKS) plugin adds full Kubernetes integration to CloudStack – allowing users to run containerized services using Kubernetes clusters through CloudStack.

CKS adds several new APIs (and updates to the UI) to provision Kubernetes clusters with minimal configuration by the user. It also provides the ability to add and manage different Kubernetes versions, meaning not only deploying clusters with chosen version, but also providing the option to upgrade an existing cluster to a new version.

The integration

CKS leverages CloudStack’s plugin framework and is disabled by default (for a fresh install or upgrade) – enabled using a global setting. It also adds global settings to set the template for a Kubernetes cluster node virtual machine for different hypervisors; to set the default network offering for a new network for a Kubernetes cluster; and to set different timeout values for the lifecycle operations of a Kubernetes cluster:

cloud.kubernetes.service.enabled Indicates whether the CKS plugin is enabled or not. Management server restart needed
cloud.kubernetes.cluster.template.name.hyperv Name of the template to be used for creating Kubernetes cluster nodes on HyperV
cloud.kubernetes.cluster.template.name.kvm Name of the template to be used for creating Kubernetes cluster nodes on KVM
cloud.kubernetes.cluster.template.name.vmware Name of the template to be used for creating Kubernetes cluster nodes on VMware
cloud.kubernetes.cluster.template.name.xenserver Name of the template to be used for creating Kubernetes cluster nodes on Xenserver
cloud.kubernetes.cluster.network.offering Name of the network offering that will be used to create isolated network in which Kubernetes cluster VMs will be launched
cloud.kubernetes.cluster.start.timeout Timeout interval (in seconds) in which start operation for a Kubernetes cluster should be completed
cloud.kubernetes.cluster.scale.timeout Timeout interval (in seconds) in which scale operation for a Kubernetes cluster should be completed
cloud.kubernetes.cluster.upgrade.timeout Timeout interval (in seconds) in which upgrade operation for a Kubernetes cluster should be completed*
cloud.kubernetes.cluster.experimental.features.enabled Indicates whether experimental feature for Kubernetes cluster such as Docker private registry are enabled or not

* There can be some variation while obeying cloud.kubernetes.cluster.upgrade.timeout as the upgrade on a cluster node must be finished either successfully or as failure for CloudStack to report status of the cluster upgrade.

Once the initial configuration is complete and the plugin is enabled, the UI starts showing a new tab ‘Kubernetes Service’ and different APIs become accessible:

Under the hood

Provisioning a Kubernetes cluster in itself can be a complex process based on the tool used (minikube, kubeadm, kubespray, etc.). CKS simplifies  and automates the complete processusing the kubeadm tool for provisioning clusters and performing lifecycle operations. As mentioned in the kubeadm documentation:

Kubeadm performs the actions necessary to get a minimum viable cluster up and running. By design, it cares only about bootstrapping, not about provisioning machines. Likewise, installing various nice-to-have addons, like the Kubernetes Dashboard, monitoring solutions, and cloud-specific addons, is not in scope.

Therefore, all orchestration for cluster node virtual machines is taken care of by CloudStack, and it is only CloudStack that decides the host or storage for the node virtual machines. CKS uses the kubectl tool for communicating with the Kubernetes cluster to query its state, active nodes, version, etc. Kubectl is a command-line tool for controlling Kubernetes clusters.

For node virtual machines, CKS requires a CoreOS based template. CoreOS has been chosen as it provides docker installation and the networking rules needed for Kubernetes. Considering the current CoreOS situation, support for a different host OS could be added in the future.

Networking for the Kubernetes cluster is provisioned using Weave Net CNI provider plugin.

The prerequisites

To successfully provision a Kubernetes cluster using CKS there are few pre-requisites and conditions that must be met:

  1. The template registered for a node virtual machine must a public template.
  2. Currently supported Kubernetes versions are 1.11.x to 1.16.x. At present, v1.17 and above might not work due to their incompatibility with weave-net plugin.
  3. A multi-master, HA cluster can be created using Kubernetes versions 1.16.x only.
  4. While creating a multi-master, HA cluster over a shared network, an external load-balancer must be manually setup. This load-balancer should have port-forwarding rules for SSH and the Kubernetes API server access. CKS assumes SSH access to cluster nodes is available from port 2222 to (2222 + cluster node count -1). Similarly, for API access port 6443 must be forwarded to master nodes. Over the CloudStack isolated network, these rules are automatically provisioned.
  5. Currently only a CloudStack isolated or shared network can be used for deployment of a Kubernetes cluster. Network must have the Userdata service enabled.
  6. For CoreOS, a minimum of 2 CPU cores and 2GB of RAM is required for deployment of a virtual machine. Therefore, a suitable service offering must be created and used while deploying a Kubernetes cluster.
  7. Node virtual machines must have Internet access at the time of cluster provisioning, scale and upgrade operations, as kubeadm cannot perform certain cluster provisioning steps without it.

The flow

After completing the initial configuration and confirming the requirements, an administrator can proceed with adding the supported Kubernetes version and deploying a Kubernetes Cluster. The addition and management of Kubernetes versions can only be done by an administrator and other users only have permissions to list supported versions. Each Kubernetes version in CKS can only be added as an ISO – this will be a binaries ISO which will contain all the Kubernetes binaries and Docker images for any given Kubernetes release. Using an ISO with the required binaries allows faster installation of Kubernetes on the node virtual machines. kubeadm needs active Internet on the master nodes during the cluster provisioning. Using an ISO with binaries and docker images prevents their download from Internet. To facilitate the creation of an ISO for a given Kubernetes release a new script named create-kubernetes-binaries-iso.sh, has been added in cloudstack-common packages. More about this script can be found in the CloudStack documentation.

Add Kubernetes cluster form in CloudStack UI:

Once there is at least one enabled and ready Kubernetes version and the node VM template in place, CKS will be ready to deploy Kubernetes clusters, which can be created using either the UI or API. Several parameters such as Kubernetes version, compute offering, network, size, HA support, node VM root disk size, etc. can be configured while creating the cluster.

Different operations can be performed on a successfully created Kubernetes cluster, such as start-stop, retrieval of cluster kubeconfig, scale, upgrade or destroy. Both UI and API provide the means to do that.

Kubernetes cluster details tab in CloudStack UI:

Once a Kubernetes cluster has been successfully provisioned, CKS deploys the Kubernetes Dashboard UI. A user can download the cluster’s kubeconfig and use that to access the cluster locally, or to deploy services on the cluster. Alternatively, kubectl tool can be used along with a kubeconfig file to access the Kubernetes cluster via the command line. Instructions for both kubectl and Kubernetes dashboard access will be available in the Kubernetes cluster details page in CloudStack.

Kubernetes Dashboard UI accessible for Kubernetes clusters deployed with CKS:

The new APIs

CKS adds a number of new APIs for performing different operations on a Kubernetes supported version and Kubernetes cluster,

Kubernetes version related APIs:

addKubernetesSupportedVersion Available only to Admin, this API allows adding a new supported Kubernetes version
deleteKubernetesSupportedVersion Available only to Admin, this API allows deletion of an existing supported Kubernetes version
updateKubernetesSupportedVersion Available only to Admin, this API allows update of an existing supported Kubernetes version
listKubernetesSupportedVersions Lists Kubernetes supported versions

Kubernetes cluster related APIs:

createKubernetesCluster For creating a Kubernetes cluster
startKubernetesCluster For starting a stopped Kubernetes cluster
stopKubernetesCluster For stopping a running Kubernetes cluster
deleteKubernetesCluster For deleting a Kubernetes cluster
getKubernetesClusterConfig For retrieving Kubernetes cluster config
scaleKubernetesCluster For scaling a created, running or stopped Kubernetes cluster
upgradeKubernetesCluster For upgrading a running Kubernetes cluster
listKubernetesClusters For listing Kubernetes clusters

The CloudStack Kubernetes Service adds a new dimension to CloudStack, allowing cloud operators to provide their users with Kubernetes offerings, but this is just the beginning! There are already ideas for improvements within the community such as support for different CloudStack zone types, support for VPC network, and use of a Debian based or a user-defined host OS template for node virtual machine. If you have an improvement to suggest, please log it in the CloudStack Github project.

More details about CloudStack Kubernetes Service can be found in the CloudStack documentation.

About the author

Abhishek Kumar is a Software Engineer at ShapeBlue, the Cloud Specialists. Apart from spending most of his time implementing new features and fixing bugs in Apache CloudStack, he likes reading about technology and politics. Outside work he spends most of his time with family and tries to work out regularly.

There is currently significant effort going on in the Apache CloudStack community to develop a new, modern, UI (user interface) for CloudStack: Project Primate. In this article, I discuss why this new UI is required, the history of this project and how it will be included in future CloudStack releases.

There are a number of key dates that current users of CloudStack should take note of and plan for, which are listed towards the end of this article.

 

We also recently held a webinar on this subject:

 

The current CloudStack UI

The current UI for Apache CloudStack was developed in 2012/13 as a single browser page UI “handcrafted” in javascript. Despite becoming the familiar face of CloudStack, the UI has always had limitations, such as no browser history, poor rendering on tablets / phones and  loss of context on refresh. Its look and feel , although good for when it was created, has become dated. However, by far the biggest issue with the existing UI  is that its 90,000 lines of code have become very difficult to maintain and extend for new CloudStack functionality. This has resulted in some new CloudStack functionality being developed as API only, and a disproportionate amount of effort required to develop new UI functionality.

How to build a new UI for Cloudstack ?

A UI R&D project was undertaken by Rohit Yadav in early 2019. Rohit is the creator and maintainer of CloudMonkey (CloudStack CLI tool) and he set off to use the lessons he’d learnt creating CloudMonkey to evaluate the different options for creating a new UI for CloudStack.

Rohit’s initial R&D work identified a set of overall UI requirements and also a set of design principles.

UI Requirements:

  • Clean Enterprise Admin  & user UI
  • Intuitive to use
  • To match existing CloudStack UI functionality and features
  • Separate UI code from core Management server code so the UI becomes a client to the CloudStack API
  • API auto-discovery of new CloudStack functionality
  • Config and Role-based rendering of buttons, actions, views etc. Dashboard, list and detail views
  • URL router and browser history driven
  • Local-storage based notification and polling
  • Dynamic language translations
  • Support desktop, tablet and mobile screen form factors

Design principles:

  • Declarative programming and web-component based
  • API discovery and param-completion like CloudMonkey
  • Auto-generated UI widgets, views, behaviour
  • Data-driven behaviour and views, buttons, actions etc. based on role-based permissions
  • Easy to learn, develop, customise, extend and maintain
  • Use modern development methodologies, frameworks and tooling
  • No DIY frameworks, reuse opensource project(s)

A number of different JavaScript frameworks were evaluated for implementation, with Vue.JS being chosen due to the speed and ease that it could be harnessed to create a modern UI. Ant Design was also chosen as it gave off-the-shelf, enterprise-class, UI building blocks and components.

Project Primate

VM Instance details in Primate

Out of these initial principles came the first iteration of Project Primate , a new Vue based UI for Apache CloudStack. Rohit presented his first cut of Primate at the Cloudstack Collaboration conference in Las Vegas in September 2019 to much excitement and enthusiasm from the community.

Unlike the old UI, primate is not part of the core CloudStack Management server code, giving a much more modular and flexible approach. This allows Primate to be “pointed” at any CloudStack API endpoint or even multiple versions of the UI to be used concurrently. The API auto-discovery allows Primate to recognise new functionality in the CloudStack API, much like CloudMonkey currently does.

Primate is designed to work across all browsers, tablets and phones. From a developer perspective, the codebase should be about a quarter that of the old UI and, most importantly, the Vue.JS framework is far easier for developers to work with.

Adoption of Project Primate by Apache Cloudstack

Primate is now being developed  by CloudStack community members in a Specialist Interest Group (S.I.G). Members of that group include developers from EWERK, PCExtreme, IndiQus, SwissTXT  and ShapeBlue.

In late October, the CloudStack community voted to adopt Project Primate as the new UI for Apache CloudStack and deprecate the old UI. The code was donated to the Apache Software Foundation and the following plan for replacement of the old UI was agreed:

Technical preview – Winter 2019 LTS release

A technical preview of the new UI will be included with the Winter 2019 LTS release of CloudStack (targeted to be in Q1 2020 and based on the 4.14 release of CloudStack). The technical preview will have feature parity with the existing UI. The  release will still ship with the existing UI for production use, but CloudStack users will be able to deploy the new UI in parallel  for testing and familiarisation purposes. The release will also include a formal advance deprecation notice of the existing UI.

At this stage, the CloudStack community will also stop taking feature requests for new functionality in the existing UI. Any new feature development in CloudStack will be based on the new UI. In parallel to this, work will be done on the UI upgrade path and documentation.

General Availability  – Summer 2020 LTS release

The summer 2020 LTS release of CloudStack will ship with the production release of the new UI. It will also be the last version of CloudStack to ship with the old UI. This release will also have the final deprecation notice for the old UI.

Old UI deprecated – Winter 2020 LTS release

The old UI code base will be removed from the Winter 2020 LTS release of CloudStack, and will not be available in releases from then onwards.

It is worth noting that, as the new primate UI is a discrete client for CloudStack that uses API discovery, the UI will be no longer bound to the core CloudStack code.  This may mean that long term the UI  may adopt its own release cycle, independent of core CloudStack releases. This long term release strategy is yet to be decided by the CloudStack community.

What CloudStack users need to do

As the old UI is being deprecated, organisations need to plan to migrate to the new CloudStack UI.

What actions specific organisations need to take depends on their use of the current UI. Many organisations only use the CloudStack UI for admin purposes, choosing other solutions to present to their end-users. It is expected that the amount of training required for admins to use the new UI will be minimal and therefore such organisations will not need to extensively plan the deployment of the new UI.

For organisations that do use the CloudStack UI to present to their users, more considered planning is suggested. Although  the new UI gives a much enhanced & intuitive experience, it is anticipated that users may need documentation updates, etc . The new UI will need to be extensively tested with any 3rd party integrations and UI customisations  at users sites. As the technology stack is completely new, it is likely that such integrations and customisations may need to be re-factored.

A summary of support for the old / new UI’s is below

Cloudstack versionLikely Release dateShips with old UIShips with new UILTS support until*
Winter 2019 LTSQ1 2020YesTechnical Previewc. Sept 2021
Summer 2020 LTSQ2/3 2020Yes (although will contain no new features from previous version)Yesc. Feb 2022
Winter 2020 LTSQ1 2021NoYesc. Sept 2022

*LTS support cycle from the Apache CloudStack community. Providers of commercial support services (such as ShapeBlue) may have different cycles.

Anybody actively developing new functionality for CloudStack needs to be aware that changes to the old UI code will not be accepted after the Winter 2019 LTS release.

Get involved

Primate on an iphone

As development of Project Primate is still ongoing, I encourage CloudStack users to download and run the Primate UI before release – it is not recommended to use the new UI in production environments until it is at GA. The code and install documentation can be found at https://github.com/apache/cloudstack-primate. This provides a unique opportunity to view the work to date, contribute ideas and test in your environment before the release date. Anybody wishing to join the SIG can do so on the dev@cloudstack.apache.org mailing list.

 

 

Introduction

In my previous post, I described the new ‘Open vSwitch with DPDK support’ on CloudStack for KVM hosts. There, I focused on describing the feature, as it was new to CloudStack, and also explained the necessary configuration on the KVM agents’ side to enable DPDK support.

DPDK (Data Plane Development Kit) (https://www.dpdk.org/) is a set of libraries and NIC drivers for fast package processing in userspace. Using DPDK along with OVS brings benefits to networking performance on VMs and networking appliances. DPDK support in CloudStack requires that the KVM hypervisor is running on DPDK compatible hardware.

In this post, I will describe the new functions which ShapeBlue has introduced for the CloudStack 4.13 LTS. With these new features, DPDK support is extended, allowing administrators to:

  • Create service offerings with additional configurations. In particular, the DPDK required additional configurations can be included on service offerings
  • Select the DPDK vHost User mode to use on each VM deployment, from these service offerings
  • Perform live migrations of DPDK-enabled VMs between DPDK-enabled hosts

CloudStack new additions for DPDK support

In the first place, it is necessary to mention that DPDK support works along with additional VM configurations. Please ensure that the global setting ‘enable.additional.vm.configuration’ is turned on.

As a reminder from the previous post, DPDK support is enabled on VMs with additional configuration details/keys:

  • ‘extraconfig-dpdk-numa’
  • ‘extraconfig-dpdk-hugepages’

One of the new additions for DPDK support is the ability to select vHost user mode to use for DPDK by the administrator via service offerings. The vHost user mode describes a client / server model between Openvswitch along with DPDK and QEMU, in which one acts as a client while the other as a server. The server creates and manages the vHost user sockets, and the client connects to the sockets created by the server.

Additional configurations on service offerings

CloudStack allows VM XML additional configurations and DPDK vHost user mode to be stored on service offerings as details and be used on VM deployments from the service offering. Additional configurations and the DPDK vHost user mode for VM deployments must be passed as service offering details to ‘createServiceOffering’ API by the administrator.

For example, the following format is valid:

(cloudmonkey)> create serviceoffering name=NAME displaytext=TEXT domainid=DOMAIN hosttags=TAGS
serviceofferingdetails[0].key=DPDK-VHOSTUSER serviceofferingdetails[0].value=server
serviceofferingdetails[1].key=extraconfig-dpdk-numa serviceofferingdetails[1].value=NUMACONF
serviceofferingdetails[2].key=extraconfig-dpdk-hugepages serviceofferingdetails[2].value=HUGEPAGESCONF

Please note:

  • Each additional configuration value must be URL UTF-8 encoded (NUMACONF and HUGEPAGESCONF in the example above).
  • The DPDK vHost user mode key must be: “DPDK-VHOSTUSER”, and its possible values are “client” and “server”. Its value is passed to the KVM hypervisors. If it is not passed, then “server” mode is assumed. Please note this value must not be encoded.
  • Additional configurations on VMs are additive to the additional configurations on service offerings.
  • In case one or more additional configuration have the same name (or key), then the additional configurations on the VM take precedence over the additional configuration on the service offering.

On VM deployment, the DPDK vHost user mode is passed to the KVM host. Based on its value:

  • When DPDK vHost user mode = “server”:
    • OVS with DPDK acts as the server, while QEMU acts as the client. This means that VM’s interfaces are created in ‘client’ mode.
    • The DPDK ports are created with type: ‘dpdkvhostuser’
  • When DPDK vHost user mode = “client”:
    • OVS with DPDK acts as the client, and QEMU acts as the server.
    • If Openvswitch is restarted, then the sockets can reconnect to the existing sockets on the server, and standard connectivity can be resumed.
    • The DPDK ports are created with type: ‘dpdkvhostuserclient’

Live migrations of DPDK-enabled VMs

Another useful functionality of DPDK support is live migration between DPDK enabled hosts. This is possible by introducing a new host capability on DPDK enabled hosts (enablement has been described on the previous post). CloudStack uses the DPDK host capability to determine which hosts are DPDK enabled.

However, the management servers also need a mechanism to decide if a VM is DPDK-enabled, before allowing live migration to DPDK enabled hosts. The decision is made on the following criteria:

  • A VM is running on a DPDK-enabled host.
  • The VM possess the DPDK required configuration from VM details or service offering details.

This allows administrators to live migrate these VMs to suitable hosts.

Conclusion

As the previous post describes, DPDK support was initially introduced in CloudStack 4.12. This blog post covers the DPDK support extension for CloudStack 4.13 LTS, introducing more flexibility and improving its usage. As CloudStack recently started supporting DPDK, more additions to its support are expected to be added in future versions.

Future work may involve UI support for the previously described features. Please note that it is currently not possible to pass additional configuration to VMs or service offerings using the CloudStack UI, it is only available through the API.

For references, please check PRs:

About the author

Nicolas Vazquez is a Senior Software Engineer at ShapeBlue, the Cloud Specialists, and is a committer in the Apache CloudStack project. Nicolas spends his time designing and implementing features in Apache CloudStack.

Introduction

CloudStack usage is a complimentary service which tracks end user consumption of CloudStack resources and summarises this in a separate database for reporting or billing. The usage database can be queried directly, through the CloudStack API, or it can be integrated into external billing or reporting systems.

For background information on the usage service please refer to the CloudStack documentation set:

In this blog post we will go a step further and deep dive into how the usage service works, how you can run usage reports from the database either directly or through the API, and also how to troubleshoot this.

Please note – in this blog post we will be discussing the underlying database structure for the CloudStack management and usage services. Whilst these have separate databases they do in some cases share table names – hence please note the databases referenced throughout – e.g. cloud.usage_event versus cloudstack_usage.usage_event, etc.

Configuration

Installation

As per the official CloudStack documentation the usage service is simply installed and started. In CentOS/RHEL this is done as follows:

# yum install cloudstack-usage
# chkconfig cloudstack-usage on
# service cloudstack-usage on

whilst on a Debian/Ubuntu server:

# apt-get install cloudstack-usage
# update-rc.d cloudstack-usage defaults
# service cloudstack-usage on

Once configure the usage service will use the same MySQL connection details as the main CloudStack management service. This is automatically added when the management service is configured with the “cloudstack-setup-databases” script (refer to http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.9/management-server/index.html). The usage service installation simply adds a symbolic link to the same db.properties file as is used by cloudstack-management:

 
# ls -l /etc/cloudstack/usage/ total 4 
lrwxrwxrwx. 1 root root 40 Sep 8 08:18 db.properties > /etc/cloudstack/management/db.properties 
lrwxrwxrwx. 1 root root 30 Sep 8 08:18 key > /etc/cloudstack/management/key 
-rw-r--r--. 1 root root 2968 Jul 12 10:36 log4j-cloud.xml 

Please note whilst the cloudstack-usage and cloudstack-management service share the same db.properties configuration file this will still contain individual settings for each service:

 
# grep -i usage /etc/cloudstack/usage/db.properties
db.usage.maxActive=100
# usage database tuning parameters
db.usage.maxWait=10000
db.usage.maxIdle=30
db.usage.name=cloud_usage
db.usage.port=3306
# usage database settings
db.usage.failOverReadOnly=false
db.usage.host=(Usage DB host IP address)
db.usage.password=ENC(Encrypted password)
db.usage.initialTimeout=3600
db.usage.username=cloud
db.usage.autoReconnect=true
db.usage.url.params=
db.usage.driver=jdbc:mysql
#usage Database
db.usage.reconnectAtTxEnd=true
db.usage.queriesBeforeRetryMaster=5000
db.usage.slaves=localhost,localhost
db.usage.autoReconnectForPools=true
db.usage.secondsBeforeRetryMaster=3600

Note the above settings would need changed if:

  • the usage DB is installed on a different MySQL server than the main CloudStack database
  • if the usage database is using a different set of login credentials

Also note that the passwords in the file above are encrypted using the method specified during the “cloudstack-setup-databases” script run – hence this also uses the referenced “key” file as shown in the above folder listing.

Application settings

Once installed the usage service is configured with the following global settings in CloudStack:

  • enable.usage.server:
    • Switches usage service on/off
    • true|false
  • usage.aggregation.timezone:
    • Timezone used for usage aggregation.
    • Refer to http://docs.cloudstack.apache.org/en/latest/dev.html for formatting.
    • Defaults to “GMT”.
  • usage.execution.timezone:
    • Timezone for usage job execution.
    • Refer to http://docs.cloudstack.apache.org/en/latest/dev.html for formatting.
  • usage.sanity.check.interval:
    • Interval (in days) to check sanity of usage data.
  • usage.snapshot.virtualsize.select:
    • Set the value to true if snapshot usage need to consider virtual size, else physical size is considered.
    • true|false – defaults to false.
  • usage.stats.job.aggregation.range:
    • The range of time for aggregating the user statistics specified in minutes (e.g. 1440 for daily, 60 for hourly. Default is 60 minutes).
    • Please note this setting would be changed in a chargeback situation where VM resources are charged on an hourly/daily/monthly basis.
  • usage.stats.job.exec.time:
    • The time at which the usage statistics aggregation job will run as an HH:MM time, e.g. 00:30 to run at 12:30am.
    • Default is 00:15.
    • Please note this time follows the setting in usage.execution.timezone above.

Please note – if any of these settings are updated then only the cloudstack-usage service needs restarted (i.e. there is no need to restart cloudstack-management).

Usage types

To track the resources utilised in CloudStack every API call where a resource is created, destroyed, stopped, started, requested and released are tracked in the cloud.usage_event table. This table has entries for every event since the start of the CloudStack instance creation, hence may grow to become quite big.

During processing every event in this table are assigned a usage type. The usage types are listed in the CloudStack documentation http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/4.9/usage.html#usage-types, or it can simply be queried using the CloudStack “listUsagetypes” API call:


# cloudmonkey list usagetypes
count = 19
usagetype:
+-------------+-----------------------------------------+
| usagetypeid | description                             |
+-------------+-----------------------------------------+
|  1          |  Running Vm Usage                       |
|  2          |  Allocated Vm Usage                     |
|  3          |  IP Address Usage                       |
|  4          |  Network Usage (Bytes Sent)             |
|  5          |  Network Usage (Bytes Received)         |
|  6          |  Volume Usage                           |
|  7          |  Template Usage                         |
|  8          |  ISO Usage                              |
|  9          |  Snapshot Usage                         |
| 10          |  Security Group Usage                   |
| 11          |  Load Balancer Usage                    |
| 12          |  Port Forwarding Usage                  |
| 13          |  Network Offering Usage                 |
| 14          |  VPN users usage                        |
| 21          |  VM Disk usage(I/O Read)                |
| 22          |  VM Disk usage(I/O Write)               |
| 23          |  VM Disk usage(Bytes Read)              |
| 24          |  VM Disk usage(Bytes Write)             |
| 25          |  VM Snapshot storage usage              |
+-------------+-----------------------------------------+

Please note these usage types are calculated depending on the nature of resource used, e.g.:

  • “Running VM usage” will simply count the hours a single VM instance is used.
  • “Volume usage” will however track both the size of each volume in addition to the time utilised.

Process flow

Overview

From a high level point of view the usage service processes data already generated by the CloudStack management service, copies this to the cloud_usage database before processing and aggregating the data in the cloud_usage.cloud_usage database:

 

Details

Using a running VM instance as example the data process flow is as follows.

Usage_event table entries

CloudStack management writes all events to the cloud.usage_event table. This happens whether the cloudstack-usage service is running or not.

In this example we will track the VM with instance ID 17. The resource tracked – be it a VM, a volume, a port forwarding rule , etc. – is listed in the usage_event table as “resource_id”, which points to the main ID field in the vm_instance, volume tables etc.


SELECT 
   * 
FROM 
   cloud.usage_event 
WHERE
   type like '%VM%' and resource_id=17;

idtypeaccount_idcreatedzone_idresource_idresource_nameoffering_idtemplate_idsizeresource_typeprocessedvirtual_size
68VM.CREATE62017-09-08 11:14:31117bbannervm12175NULLXenServer0NULL
70VM.START62017-09-08 11:14:41117bbannervm12175NULLXenServer0NULL
123VM.STOP62017-09-26 13:44:48117bbannervm12175NULLXenServer0NULL
125VM.DESTROY62017-09-26 13:45:00117bbannervm12175NULLXenServer0NULL

Please note: a lot of the resources will obviously still be in use – i.e. they will not have a destroy/release entry. In this case the usage service considers the end date to be open, i.e. all calculations are up until today.

Usage_event copy

When the usage job runs (at “usage.stats.job.exec.time”) it first copies all new entries since the last processing time from the cloud.usage_event table to the cloud_usage.usage_event table.

The only difference between the two tables is the “processed” column – in the cloud database this is always set to 0 – nil, however once the table entry is processed in the cloud_usage database this field is updated to 1.

In comparison – the entries in the cloud database:


SELECT 
   * 
FROM
   cloud.usage_event 
WHERE
   id > 130;
idtypeaccount_idcreatedzone_idresource_idresource_nameoffering_idtemplate_idsizeresource_typeprocessedvirtual_size
131VOLUME.CREATE62017-09-26 13:45:44131bbannerdata36NULL2147483648NULL0NULL
132NET.IPASSIGN62017-09-26 13:46:0511710.1.34.77NULL00VirtualNetwork0NULL
133VM.STOP82017-09-28 10:31:44123secretprojectvm1175NULLXenServer0NULL
134NETWORK.OFFERING.REMOVE82017-09-28 10:31:44123418NULL0NULL0NULL

Compared to the same entries in cloud_usage:


SELECT 
   * 
FROM 
   cloud_usage.usage_event
WHERE 
   id > 130;
idtypeaccount_idcreatedzone_idresource_idresource_nameoffering_idtemplate_idsizeresource_typeprocessedvirtual_size
131VOLUME.CREATE62017-09-26 13:45:44131bbannerdata36NULL2147483648NULL1NULL
132NET.IPASSIGN62017-09-26 13:46:0511710.1.34.77NULL00VirtualNetwork1NULL
133VM.STOP82017-09-28 10:31:44123secretprojectvm1175NULLXenServer1NULL
134NETWORK.OFFERING.REMOVE82017-09-28 10:31:44123418NULL0NULL1NULL

Account copy

As part of this copy job the cloudstack-usage service will also make a copy of some of the columns in the cloud.account table such that a ownership of resources can be easily established during processing.

Usage summary and helper tables

In the first usage aggregation step all usage data per account and per usage type is summarised in helper tables. Continuing the example above the CREATE+DESTROY events as well as the VM START+STOP events are summarised in the “usage_vm_instance” table:


SELECT
   *
FROM
   cloud_usage.usage_vm_instance 
WHERE
   vm_instance_id=17;

usage_typezone_idaccount_idvm_instance_idvm_nameservice_offering_idtemplate_idhypervisor_typestart_dateend_datecpu_speedcpu_coresmemory
11617bbannervm12175XenServer2017-09-08 11:14:412017-09-26 13:44:48NULLNULLNULL
21617bbannervm12175XenServer2017-09-08 11:14:312017-09-26 13:45:00NULLNULLNULL

Note the helper table has now summarised the data with the usage type mentioned above – and the start/end dates are contained in the same database row.

Please note – if a resource is still in use then the end date simply isn’t populated, i.e. all calculations will work on rolling end date of today.

If we now also compare the volume used by VM instance ID 17 we find this in the cloud_usage.usage_volume helper table:

SELECT
 usage_volume.*
FROM
 cloud_usage.usage_volume
LEFT JOIN
 cloud.volumes ON (usage_volume.id = volumes.id)
WHERE
 cloud.volumes.instance_id = 17;
idzone_idaccount_iddomain_iddisk_offering_idtemplate_idsizecreateddeleted
18162NULL5214748364802017-09-08 11:14:312017-09-26 13:45:00

As the database selects above show – each helper table will contain only the information pertinent to that specific usage type, hence the cloud_usage.usage_vm_instance contains information about VM service offering, template and hypervisor type the cloud_usage.usage_volume contains information about disk offering ID, template ID and size.

If a usage type for a resource has been started/stopped or requested/released multiple times then each period of use will be listed in the helper tables:


SELECT 
   * 
FROM
   cloud_usage.usage_vm_instance
WHERE
   vm_instance_id=12;

usage_typezone_idaccount_idvm_instance_idvm_nameservice_offering_idtemplate_idhypervisor_typestart_dateend_datecpu_speedcpu_coresmemory
11612bbannervm2175XenServer2017-09-08 09:30:372017-09-08 09:30:49NULLNULLNULL
11612bbannervm2175XenServer2017-09-08 11:14:03NULLNULLNULLNULL
21612bbannervm2175XenServer2017-09-08 09:30:20NULLNULLNULLNULL

Usage data aggregation

Once all helper tables have been populated the usage service now creates time aggregated database entries in the cloud_usage.cloud_usage table. In all simplicity this process:

  1. Analyses all entries in the helper tables.
  2. Splits up this data based on “usage.stats.job.aggregation.range” to create individual usage timeblocks.
  3. Repeats this process for all accounts and for all resources.

So – looking at the VM with ID=17 analysed above:

  • This had a running start date of 2017-09-08 11:14:41, an end date of 2017-09-26 13:44:48.
  • The usage service is set up with usage.stats.job.aggregation.range=1440, i.e. 24 hours.
  • The usage service will now create entries in the cloud_usage.cloud_usage table for every full and partial 24 hour period this VM was running.

SELECT 
   *
FROM
   cloud_usage.cloud_usage
WHERE 
   usage_id=17 and usage_type=1;

idzone_idaccount_iddomain_iddescriptionusage_displayusage_typeraw_usagevm_instance_idvm_nameoffering_idtemplate_idusage_idtypesizenetwork_idstart_dateend_datevirtual_sizecpu_speedcpu_coresmemoryquota_calculated
64162bbannervm12 running time (ServiceOffering: 17) (Template: 5)12.755278 Hrs112.75527763366699217bbannervm1217517XenServerNULLNULL2017-09-08 00:00:002017-09-08 23:59:59NULLNULLNULLNULL0
146162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-09 00:00:002017-09-09 23:59:59NULLNULLNULLNULL0
221162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-10 00:00:002017-09-10 23:59:59NULLNULLNULLNULL0
.....................................................................
1271162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-24 00:00:002017-09-24 23:59:59NULLNULLNULLNULL0
1346162bbannervm12 running time (ServiceOffering: 17) (Template: 5)24 Hrs12417bbannervm1217517XenServerNULLNULL2017-09-25 00:00:002017-09-25 23:59:59NULLNULLNULLNULL0
1427162bbannervm12 running time (ServiceOffering: 17) (Template: 5)13.746667 Hrs113.7466669082641617bbannervm1217517XenServerNULLNULL2017-09-26 00:00:002017-09-26 23:59:59NULLNULLNULLNULL0

Since all of these entries are split into specific dates it is now relatively straight forward to run a report to capture all resource usage for an account over a specific time period, e.g. if a monthly bill is required.

Querying usage data through the API

The usage records can also be queried through the API by using the “listUsagerecords” API call. This uses similar syntax to the above – but there are some differences:

  • The API call requires start and end dates, these are in a “yyyy-MM-dd HH:mm:ss” or simply a “yyyy-MM-dd” format.
  • The usage type is same as above, e.g. type=1 for running VMs.
  • Usage ID is however the UUID attached to the resource in question, e.g. in the following example VM ID 17 actually has UUID 4358f436-bc9b-4793-b1be-95fa9b074fd5 in the vm_instance table.
  • The API call can also be filtered for account/accountid/domain.

More information on the syntax can be found in http://cloudstack.apache.org/api/apidocs-4.9/apis/listUsageRecords.html .

The following API query will list the first three day’s worth of usage data listed in the table above:


# cloudmonkey list usagerecords type=1 startdate=2017-09-09 enddate=2017-09-10 usageid=4358f436-bc9b-4793-b1be-95fa9b074fd5
count = 3
usagerecord:
+-----------------------------+---------+--------------------------------------+-----------------------------+--------------------------------------------------------------+-------------+--------------------------------------+--------------------------------------+-----------+------------+--------------------------------------+----------+--------------------------------------+---------------+--------------------------------------+-----------+--------------------------------------+
| startdate                   | account | domainid                             | enddate                     | description                                                  | name        | virtualmachineid                     | offeringid                           | usagetype | domain     | zoneid                               | rawusage | templateid                           | usage         | usageid                              | type      | accountid                            |
+-----------------------------+---------+--------------------------------------+-----------------------------+--------------------------------------------------------------+-------------+--------------------------------------+--------------------------------------+-----------+------------+--------------------------------------+----------+--------------------------------------+---------------+--------------------------------------+-----------+--------------------------------------+
| 2017-09-08'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-08'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 12.755278| 47dd8c98-946e-11e7-b419-0666ae010714 | 12.755278 Hrs | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |
| 2017-09-09'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-09'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 24       | 47dd8c98-946e-11e7-b419-0666ae010714 | 24 Hrs        | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |
| 2017-09-10'T'00:00:00+00:00 | bbanner | f3501b29-01f7-44ce-a266-9e3f12c17394 | 2017-09-10'T'23:59:59+00:00 | bbannervm12 running time (ServiceOffering: 17) (Template: 5) | bbannervm12 | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | 60d9aaf1-7ff7-472e-b29f-6768d0cb5702 | 1         | Subdomain1 | d4b9d32e-d779-48b8-814d-d7847d55a684 | 24       | 47dd8c98-946e-11e7-b419-0666ae010714 | 24 Hrs        | 4358f436-bc9b-4793-b1be-95fa9b074fd5 | XenServer | 8c2d592f-78e1-4e92-a910-1e4b865240cf |
+-----------------------------+---------+--------------------------------------+-----------------------------+--------------------------------------------------------------+-------------+--------------------------------------+--------------------------------------+-----------+------------+--------------------------------------+----------+--------------------------------------+---------------+--------------------------------------+-----------+--------------------------------------+

Analysing and reporting on usage data

The usage data can be analysed in any reporting tool – from the various CloudStack billing platforms, to enterprise billing systems as well as simpler tools like Excel. Since the cloud_usage.cloud_usage data is fully aggregated into time utilised blocks, it is now just a question of summarising data based on usage type, accounts, service offerings, etc.

The following SQL queries are provided as examples only – in a real use case these will most likely require to be changed and refined to the specific reporting requirements.

Running VMs

To find usage data for all running VMs run during the month of September we search for usage type=1 and group by vm_instance. For a VM instance we summarise how many hours each VM has been running – however in a real billing scenario this would most likely also be broken down into e.g. how many hours of VM usage has been utilised per VM service offering.


SELECT
   account_id,
   account_name,
   usage_type,
   offering_id,
   vm_instance_id,
   vm_name,
   SUM(raw_usage) as VMRunHours
FROM
   cloud_usage.cloud_usage
LEFT JOIN 
   cloud_usage.account on (cloud_usage.account_id = account.id)
WHERE 
   start_date LIKE '2017-09%' 
   AND usage_type = 1
GROUP BY 
   vm_instance_id
ORDER BY 
   account_id ASC, vm_instance_id ASC;
account_idaccount_nameusage_typeoffering_idvm_instance_idvm_nameVMRunHours
2admin113rootvm13.0205559730529785
2admin11720rootvm2539.7991666793823
4pparker1175pparkervm1542.5497226715088
4pparker11714pparkervm50.26527804136276245
4pparker11715pparkervm70.2247224897146225
4pparker11716pparkervm16540.774167060852
4pparker11722ppvpcvm1000539.7311105728149
5ckent1177ckentvm15.246944904327393
5ckent1179ckentvm2435.4169445037842
5ckent11718ckentvm230.8186113834381104
5ckent11725ckentvm30106.28194522857666
6bbanner11710bbannervm11.7469446659088135
6bbanner11712bbannervm2540.7691669464111
6bbanner11717bbannervm12434.50194454193115
6bbanner11726bbannervm30106.24055576324463
8PrjAcct-SecretProject-111723secretprojectvm1477.4819440841675

Network utilisation

The following will summarise network usage for sent (usage type=4) and received (usage type=5) traffic on a per account basis, again this is listing for the month of September.

For network utilisation the usage is simply summarised as total Bytes sent or received:


SELECT
   account_id,
   account_name,
   usage_type,
   network_id,
   SUM(raw_usage) as TotalBytes
FROM
   cloud_usage.cloud_usage
LEFT JOIN 
   cloud_usage.account on (cloud_usage.account_id = account.id)
WHERE 
   start_date LIKE '2017-09%' 
   AND usage_type in (4,5)
GROUP BY 
   account_id, usage_type
ORDER BY 
   account_id ASC;
account_idaccount_nameusage_typenetwork_idTotalBytes
2admin4204391320
2admin52041744
4pparker4200164764260
4pparker5200163779643
5ckent4206391500
5ckent52060
6bbanner4207776700
6bbanner52070
8PrjAcct-SecretProject-14211343080
8PrjAcct-SecretProject-152110

Volume utilisation

For volume or general storage utilisation (applies to snapshots as well) the usage is calculated as storage hours – e.g. GbHours. In this example we again summarise for all volumes (usage type=6) on a per account and disk basis during the month of September. Please note in this case we have to do multiple joins (or nested WHERE statements) to look up volume IDs, VM name, etc.


SELECT
   cloud_usage.cloud_usage.account_id,
   cloud_usage.account.account_name,
   cloud_usage.cloud_usage.usage_type,
   cloud_usage.cloud_usage.usage_id,
   cloud.vm_instance.name as Instance_Name,
   cloud.volumes.name as Volume_Name,
   cloud_usage.cloud_usage.size/(1024*1024*1024) as DiskSizeGb,
   SUM(cloud_usage.cloud_usage.raw_usage) as TotalHours,
   sum(cloud_usage.cloud_usage.raw_usage*cloud_usage.cloud_usage.size/(1024*1024*1024)) as GbHours
FROM
   cloud_usage.cloud_usage
LEFT JOIN 
   cloud_usage.account on (cloud_usage.account_id = account.id)
LEFT JOIN
   cloud.volumes on (cloud_usage.usage_id = volumes.id)
LEFT JOIN 
   cloud.vm_instance on (cloud.volumes.instance_id = cloud.vm_instance.id)
WHERE 
   start_date LIKE '2017-09%' AND usage_type = 6
GROUP BY 
   usage_id
ORDER BY 
   account_id ASC, usage_id ASC;

account_idaccount_nameusage_typeusage_idInstance_NameVolume_NameDiskSizeGbTotalHoursGbHours
2admin63rootvm1ROOT-320.0000542.883610725402810857.672214508057
2admin623rootvm2ROOT-2020.0000539.803333282470710796.066665649414
4pparker65pparkervm1ROOT-520.0000542.649444580078110852.988891601562
4pparker615pparkervm5ROOT-1420.0000541.044167518615710820.883350372314
4pparker616pparkervm7ROOT-1520.00000.22916693985462194.583338797092438
4pparker617pparkervm16ROOT-1620.0000540.777222633361810815.544452667236
4pparker625ppvpcvm1000ROOT-2220.0000539.735555648803710794.711112976074
5ckent67ckentvm1ROOT-720.0000436.33611202239998726.722240447998
5ckent69ckentvm2ROOT-920.0000542.558610916137710851.172218322754
5ckent620ckentvm23ROOT-1820.0000434.362777709960948687.255554199219
5ckent622NULLckentdata12.0000540.6513881683351081.30277633667
5ckent629ckentvm30ROOT-2520.0000106.286389350891112125.7277870178223
6bbanner610bbannervm1ROOT-1020.00001.77138912677764935.42778253555298
6bbanner612bbannervm2ROOT-1220.0000542.494444847106910849.888896942139
6bbanner613bbannervm2bbannerdatadisk12.0000542.3058328628541084.611665725708
6bbanner618bbannervm12ROOT-1720.0000434.50805568695078690.161113739014
6bbanner619bbannervm2bbannerdata25.0000540.75361156463622703.768057823181
6bbanner630bbannervm30ROOT-2620.0000106.244722366333012124.89444732666
6bbanner631bbannervm30bbannerdata32.0000106.23777770996094212.47555541992188
8PrjAcct-SecretProject-1626secretprojectvm1ROOT-2320.0000538.97583293914810779.516658782959
8PrjAcct-SecretProject-1628secretprojectvm1secretprojectdata12.0000538.75250053405761077.5050010681152

 

IP addresses, port forwarding rules and VPN users

For other usage types where – similar to VM running hours – we simply report on the total hours utilised we again summarise the raw_usage, but since the description in cloud_usage.cloud.usage is clear enough we don’t need to go looking elsewhere for this information. In the following example we report on IP address usage (usage type=3), port forwarding rules (12) and VPN users (14):


SELECT
   cloud_usage.cloud_usage.account_id,
   cloud_usage.account.account_name,
   cloud_usage.cloud_usage.usage_type,
   cloud_usage.cloud_usage.usage_id,
   cloud_usage.cloud_usage.description,
   SUM(cloud_usage.cloud_usage.raw_usage) as TotalHours
FROM
   cloud_usage.cloud_usage
LEFT JOIN
   cloud_usage.account on (cloud_usage.account_id = account.id)
WHERE
   start_date LIKE '2017-09%' AND usage_type in (3,12,14)
GROUP BY 
   description
ORDER BY
   account_id ASC, usage_id ASC;

 

account_idaccount_nameusage_typeusage_iddescriptionTotalHours
2admin33IPAddress: 10.1.34.63542.8833332061768
4pparker34IPAddress: 10.1.34.64542.648889541626
4pparker313IPAddress: 10.1.34.73539.7686109542847
5ckent35IPAddress: 10.1.34.65542.6322221755981
5ckent36IPAddress: 10.1.34.66542.5547218322754
5ckent37IPAddress: 10.1.34.67542.5541667938232
5ckent310IPAddress: 10.1.34.70540.6561107635498
5ckent311IPAddress: 10.1.34.71540.2247219085693
5ckent312IPAddress: 10.1.34.72540.0552778244019
5ckent316IPAddress: 10.1.34.76106.27805614471436
6bbanner141VPN User: bbannervpn1, Id: 1 usage time542.4766664505005
6bbanner142VPN User: brucesdogvpn1, Id: 2 usage time1.7355557680130005
6bbanner143VPN User: bruceswifevpn1, Id: 3 usage time540.7405557632446
6bbanner144VPN User: stanleevpn1, Id: 4 usage time540.7180547714233
6bbanner38IPAddress: 10.1.34.68542.529444694519
6bbanner129Port Forwarding Rule: 9 usage time1.6469446420669556
6bbanner39IPAddress: 10.1.34.69542.4852771759033
6bbanner317IPAddress: 10.1.34.77106.2319450378418
8PrjAcct-SecretProject-1314IPAddress: 10.1.34.74538.9755554199219
8PrjAcct-SecretProject-1315IPAddress: 10.1.34.75538.7594442367554

Troubleshooting

Service management

As described earlier in this blog post the usage job will run at a time specified in the usage.stats.job.exec.time global setting.

Once the job has ran it will update its own internal database with the run time and the start/end times processed:


SELECT * FROM cloud_usage.usage_job;

 

idhostpidjob_typescheduledstart_millisend_millisexec_timestart_dateend_datesuccessheartbeat
1acshostname/192.168.10.1023589001504828800000150491519999920722017-09-08 00:00:002017-09-08 23:59:5912017-09-09 00:14:53
2acshostname/192.168.10.102358900150491520000015050015999996072017-09-09 00:00:002017-09-09 23:59:5912017-09-10 00:14:53
3acshostname/192.168.10.102358900150500160000015050879999995362017-09-10 00:00:002017-09-10 23:59:5912017-09-11 00:14:53
4acshostname/192.168.10.102358900150508800000015051743999995032017-09-11 00:00:002017-09-11 23:59:5912017-09-12 00:14:53
5acshostname/192.168.10.102358900150517440000015052607999995092017-09-12 00:00:002017-09-12 23:59:5912017-09-13 00:14:53

A couple of things to note on this lists:

  • Start_millis and end_millis simply list the epoch timestamp in start_date and end_date. The epoch time is used by the usage service to determine cloud_usage.cloud_usage entries.
  • Exec_time will list how long the usage job ran for. This is useful in cases where the usage job processing time is longer than 24 hours – i.e. where usage job schedules may start overlapping.
  • The success field is set to 1 for success, 0 for failure.
  • Heartbeat lists when the job was ran.

When the cloudstack-usage service is restarted this will run checks against the usage_jobs table to determine:

  • If the last scheduled job was ran. If this wasn’t done the job is ran again, i.e. a service startup will run a single missed job.
  • Thereafter the usage job will run at its normal scheduled time.

Usage troubleshooting – general advice

Since this blog post covers topics around adding/updating/removing entries in the cloud and cloud_usage databases we always advise CloudStack users to take MySQL dumps of both databases before doing any work – whether this directly in MySQL or via the usage API calls. 

Database inconsistencies

Under certain circumstances (e.g. if the cloudstack-management service crashes) the cloud.usage_event table may have inconsistent entries, e.g.:

  • STOP entries without a START entry, or DESTROY entries without a CREATE.
  • Double entries – i.e. a VM has two START entries.

The usage logs will show where these failures occur. The fix for these issues is to add/delete entries as required in the cloud.usage_event table, e.g. add a VM.START with date stamp if missing and so on.

Usage service logs

The usage service writes all logs to /var/log/cloudstack/usage/usage.log. These logs are relatively verbose and will outline all actions performed during the usage job:


DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Parsing IP Address usage for account: 2
DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Total usage time 86400000ms
DEBUG [usage.parser.IPAddressUsageParser] (Usage-Job-1:null) (logid:) Creating IP usage record with id: 3, usage: 24, startDate: Tue Oct 10 00:00:00 UTC 2017, endDate: Tue Oct 10 23:59:59 UTC 2017, for account: 2
DEBUG [usage.parser.VPNUserUsageParser] (Usage-Job-1:null) (logid:) Parsing all VPN user usage events for account: 2
DEBUG [usage.parser.VPNUserUsageParser] (Usage-Job-1:null) (logid:) No VPN user usage events for this period
DEBUG [usage.parser.VMSnapshotUsageParser] (Usage-Job-1:null) (logid:) Parsing all VmSnapshot volume usage events for account: 2
DEBUG [usage.parser.VMSnapshotUsageParser] (Usage-Job-1:null) (logid:) No VM snapshot usage events for this period
DEBUG [usage.parser.VMInstanceUsageParser] (Usage-Job-1:null) (logid:) Parsing all VMInstance usage events for account: 3
DEBUG [usage.parser.NetworkUsageParser] (Usage-Job-1:null) (logid:) Parsing all Network usage events for account: 3
DEBUG [usage.parser.VmDiskUsageParser] (Usage-Job-1:null) (logid:) Parsing all Vm Disk usage events for account: 3

Housekeeping of cloud_usage table

To carry out housekeeping of the cloud_usage.cloud_usage table the “RemoveRawUsageRecords” API call can be used to delete all usage entries older than a certain number of dates. Note – since the cloud_usage table only contains completed parsed entries deleting anything from this table will not lead to inconsistencies – rather just cut down on the number of usage records being reported on.

More information can be found in http://cloudstack.apache.org/api/apidocs-4.9/apis/removeRawUsageRecords.html.

The following example deletes all usage records older than 5 days:


# cloudmonkey removeRawUsageRecords interval=5
success = true

Regenerating usage data

The CloudStack API also has a call for regenerating usage records – generateUsageRecords. This can be utilised to rerun the usage job in case of job failure. More information can be found in the CloudStack documentation – http://cloudstack.apache.org/api/apidocs-4.9/apis/generateUsageRecords.html.

Please note the comment on the above documentation page:  “This will generate records only if there any records to be generated, i.e. if the scheduled usage job was not run or failed”. In other words this API call should not be made ad-hoc apart from in this specific situation.


# cloudmonkey generateUsageRecords startdate=2017-09-01 enddate=2017-09-30
success = true

Quota service

Anyone looking through the cloud_usage database will notice a number of quota_* tables. These are not directly linked to the usage service itself, they are rather consumed by the Quota service. This service was created to monitor usage of CloudStack resources based on a per account credit limit and a per resource credit cost.

For more information on the Quota service please refer to the official CloudStack documentation / CloudStack wiki:

Conclusion

The CloudStack usage service can seem complicated for someone just getting started with it. We hope this blog post has managed to explain the background processes and how to get useful data out of the service.

We always value feedback – so if you have any comments or questions around this blog post please feel free to get in touch with the ShapeBlue team.

About The Author

Dag Sonstebo is a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends his time designing, implementing and automating IaaS solutions based around Apache CloudStack.

Introduction

The CloudStack management server listens by default on port 8250 for agents, and this is secured by one-way SSL authentication using the management server’s self-generated server certificates. While this encrypts the connection, it does not authenticate and validate the connecting agent (client). Upcoming features such as support for container/application cluster services require certificate management, and the emerging common theme is that CloudStack needs an internal certificate authority (CA) that can provide and ensure security and authenticity of client-server connections, and issue, revoke and provision certificates.

Solution

To solve these problems, we designed and implemented a new pluggable CA framework with a default self-signed root CA provider plugin, that makes CloudStack a root CA. Initial support is available for securing KVM hosts and systemvm agents, along with communication between multiple management servers. The feature also provides new APIs for issuance, revocation, use, and provision of certificates. For more details, here is the functional specification of the feature.

The new CA framework and root CA provider plugin for CloudStack was accepted by the community recently, and will be available in CloudStack 4.11 (to be released in the near future).

How does it work?

The CA framework injects itself into CloudStack’s server and client components, and provides separation of independent policy enforcement and mechanism implementation. Various APIs such as APIs for issuance, revocation, and provision of certificates plugin into the mechanism implementation provided by a CA provider plugin. In addition, the feature supports the automatic renewal of expiring certificates on an agent or a host, and will alert admins for the same if auto-renewal is disabled or something goes wrong.

The feature ships with a built-in default root CA provider plugin that acts as a self-signed root CA authority, and issues certificates signed by its self-generated and signed CA certificate. It also allows developers to write their own CA provider plugin. If the configured CA provider plugin supports sharing of its CA certificate, a button will appear on the UI to download the CA certificate that can be imported to one’s browser, host, etc.

OK, what happens after we upgrade?

After upgrading CloudStack to a version which has this feature (e.g. 4.11), there will be no visible change and no additional steps are required. The root CA provider plugin will be configured and used by default and the global setting ca.plugin.root.auth.strictness will be set to false to mimic the legacy behaviour of one-way SSL authentication during handshake.

Post-upgrade, the CA framework will set up additional security (by means of keystore and certificates) on new KVM hosts and SystemVMs. If CloudStack admins want to enforce stricter security, they can upgrade and onboard all existing KVM and SystemVM agents, use the provisionCertificate API, set the global setting ca.plugin.root.auth.strictness to true (new CloudStack installations will have this setting set to true by default), and finally restart the management server(s). The SystemVM agents and (KVM) hosts will be in Up and connected state once two-way SSL handshake has correctly verified and authenticated the client-server connections.

Here’s a link to the official CloudStack Admin Documentation for more details.

About the author

Rohit Yadav is a Software Architect at ShapeBlue, the Cloud Specialists, and is a committer and PMC member of Apache CloudStack. Rohit spends most of his time designing and implementing features in Apache CloudStack.

Intro

What is HA?

“High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. ”  — Wikipedia

HA in CloudStack is currently supported for VMs only. In order to have it enabled, the service offering of the VM should be HA enabled, otherwise the VMs would not been taken into consideration. There is no HA activity around hosts at this stage, so we don’t have a defense mechanism if a host goes down. All investigations are VM-centric and we’re unable to determine the health of the host or whether it is actually still running the VM. This may result in the VM-HA mechanism starting the same VM on a different host while the faulty host is still running it, which would result in corrupt VMs and disks. Such issues have been seen in large scale KVM deployments.

 

 

Host-ha

The Solution

Such issues motivated us to figure out a long term solution to this problem, and we identified that the root of all evil is down to a lack of a reliable fencing and recovering mechanism. A new investigation model had to be introduced in order to achieve this, simply because the VM-centric one wasn’t going to be sufficient. Of-course, it needed to be easy to maintain for administrators.

Setting this as our destination point, we started defining our route to get there.  The first thing that became obvious to us is that CloudStack is missing an OOBM tool to fence and recover hosts. OOBM is the ability to execute power cycle operations to a certain host. So – we developed the CloudStack OOBM Plugin, which implements industry standard IPMI 2.0 provider, supported by most vendors. This way when enabled per host, users would be able to issue power commands such as: On, Off, Reset, etc.
OOBM Feature Specification

Host-HA Granular configuration: offers admins an ability to set explicit configuration on host/cluster/zone level. This way in a large environment some hosts from a cluster can be HA-enabled and some not, depending on the setup and specific hardware that is running.

Threshold based investigator: where the admin can set a specific point of failed investigations, only when it’s exceeded would the host transition is in a different state.

More accurate investigating: Host-HA to uses both health checks and activity checks to take decisions on recovering and fencing actions. Once determined the resource (host) is in faulty state (health checks failed) it runs activity check to figure out if there is any disk activity on the VMs running on the specific host.

Host-HA Design

Host-HA design aims to offer a way to separate policy from mechanism, where individuals are free to use different sets of pluggable tools (ha providers and OOBM tools), while having the same policy applied. Administrator can set the thresholds in global settings and not worry about the mechanism which is going to enforce it. With the resource management service CloudStack admins can manage lifecycle operations per resource and use a kill switch on zone/cluster/host level to disable HA policy enforcement. The framework itself is resource type agnostic and can be extended with any other resources within CloudStack, like for instance load-balancers.

HA Providers are resource specific and are responsible to execute the HA framework and force the applied policy. For example, the KVM HA Provider, as part of this feature, works with KVM Hosts and carries out the HA related activities.
A State-Machine implements event triggers and transitions of a specific HA resource based on which the framework takes the required actions to bring it to the right physical state. For example, if it passes the threshold for being in degraded state it will try to recover it, then the framework will issue an OOBM Restart task which will reset the host power and eventually host will come up. Here’s a list of the States:

Available – the feature is Enabled and Host-HA is available
Suspect – there are health checks failing with the Host
Checking – activity checks are being performed
Degraded – host is passing activity check ratio and still providing service to the end user, but cannot be managed from CloudStack Management
Recovering – the Host-HA framework is trying to Recover the host by issuing OOBM job
Recovered – the Host-HA framework has recovered the Host successfully
Fencing – the Host-HA framework is trying to Fence the host by issuing OOBM job
Fenced – the Host-HA framework has recovered the Host successfully
Disabled –  feature is Disabled for the Host
Ineligible – feature is Enabled, but it cannot be managed successfully by the Host-HA framework (possible OOBM not configured properly)

Please find this image and image of the FSM-Transitions, where all possible transitions are defined with the conditions that are required to move on with next state.

Host-HA on KVM host

Host-HA on KVM hosts is provided by the KVM HA Provider. It uses the STONITH (Shoot the other node in the head) fencing model. It also provides mechanism for activity checks on disks on the shared NFS storage. How does it work? While in a cluster, neighboring hosts are able to perform activity checks on VMs disks running on a faulty (health checks failed) host. The activity check is verifying if there’s any actual activity on the VM disk while the host where it’s running has been reported in bad health, if there is activity then the host would stay in degraded state, if there’s not the HA Framework will transition it to Recovering state and it’ll try to bring it back up. In case it fails the threshold for recovery it will fence it by powering off the machine.

Please checkout the FS for more technical details

Find the pull request on the Apache CloudStack Public Repo

HOST-HA and VM-HA coordination

For KVM HOST HA to work effectively it has to work in tandem with the existing VM HA framework. The current CloudStack implementation focuses on VM-HA as these are the first class entities, while a host is considered to be a resource. The CloudStack manages host states and a rough mapping of CloudStack states vs. the KVM Host HA state is as below:

VM-HA host States KVM Host HA host states
Up Available
Up (Investigating) Suspect/Checking
Alert Degraded
Disconnected Recovering/Recovered/Fencing
Down Fenced
Ineligible/Disabled

The Host HA improves on Investigation by providing a new way of investigating VM using VM disk activity. It also adds on to the fencing capabilities by integrating with OOBM feature.

In order for VM HA to work correctly and in sync with Host HA it is important that the state of host seen by the two is same as per the above table. VM-HA model has been modified to query the Host-HA states to get the actual host state, when the feature is enabled. It also makes sure VM-HA related activities are not started unless the host has been properly fenced.

About the author

Boris Stoyanov is Software Engineer in testing at ShapeBlue, The Cloud Specialists. Bobby spends his time testing features for the Apache CloudStack Community and for our ShapeBlue clients.

Introduction

Managing user roles has been a pain for a while, as the model of having a commands.properties file that defines roles and their permissions can be hard to comprehend and use. Due to this, not many CloudStack users made any changes to the default harcoded roles and further enhanced roles. Therefore, ShapeBlue has taken the opportunity to rewrite the Roles-Based Access Control (RBAC) unit into a Dynamic Roles model. The changes allows CloudStack Root Admin to create new roles with customised permissions from the CloudStack UI by allowing / denying specific APIs. It deprecates the old fashioned commands.properties file and transfers all the rules into the DB. This is available in CloudStack 4.9.x and greater.

How it works?

Dynamic RBAC introduces a new tab in the CloudStack Console called Roles. Root Admins by default are able to navigate there and Create / Update all roles. When creating a new role, the Root Admin is able to select the rules that apply to that role, and can define a list of APIs which they could allow or deny for the role. When the user (assigned with a specific role) issues an API request, the backend checks the requested API against configured rules for the assigned role, and the user will only be able to call the API if it’s allowed on the list. If denied or not listed it won’t be possible to call the API.

How to use it?

In this example, let’s assume we want to create a Root Admin that has read-only rights on everything but “Global Settings” in CloudStack.

The following rules configuration shows an example of this custom role, that is only able to view resources. The image bellow shows the rules tab of the custom role called “read-only”. Please observe that only “list*” APIs are allowed, meaning that a user with this role will not be able to delete / update anything within CloudStack, but just use the list APIs. Also, note an addition that is denying any APIs related to configurations (*Configuration). Due to this, the user will not be able to see anything within the “Global Settings”. The order of configuring rules list is also very important – Dynamic Roles Checker iterates the list top-down, so when configuring, it is best practice to shift “Deny” rules to the top. Shifting rules is possible by simply drag-and-dropping the rule by clicking the  button. In this particular case, if “Allow list*” rule was on top of the “Deny *Configurations”, user would be able to see the Global Settings.

When the user hits an API that is Denied, he will be prompted the following generic error message:

 

OK, what happens if we upgrade to 4.9?

Dynamic Roles is available and enabled by default to all new installations after CloudStack 4.9.x release. If a user upgrades from older version (to 4.9.x or greater), Dynamic Roles will be disabled by default and it will follow the old fashioned way of handling RBAC (ie, with command.properties file.) After the upgrade existing deployments of CloudStack can be migrated to Dynamic RBAC by running a migration tool which is part of 4.9 installation. The migration tool is located at the following directory on the management server: /usr/share/cloudstack-common/scripts/util/migrate-dynamicroles.py

When running this tool it will enable Dynamic RBAC, then copy all existing hard-coded roles from commands.properties file, create the same entities in the database following the data format of Dynamic Roles. Finally, it will rename the commands.properties file to “command.properties.deprecated”, as a backup file.

Example:

/usr/share/cloudstack-common/scripts/util/migrate-dynamicroles.py -u cloud -p cloud -h localhost -p 3306 -f /etc/cloudstack/management/commands.properties

The script above is a python3, so python3 should be available.

Running this will output the following:
Apache CloudStack Role Permission Migration Tool

(c) Apache CloudStack Authors and the ASF, under the Apache License, Version 2.0 

Running this migration tool will remove any default-role permissions from cloud.role_permissions. Do you want to continue? [y/N]y

The commands.properties file has been deprecated and moved at: /etc/cloudstack/management/commands.properties.deprecated

Static role permissions from commands.properties have been migrated into the db

Dynamic role based API checker has been enabled! 

And you’re all set, no need to restart management servers! There’s a new global setting introduced with this feature called ‘dynamic.apichecker.enabled’. If it is set to “True” it means the Dynamic Roles is enabled. If by any chance if there is any failure with migration it will roll-back the procedure and will revert to the old hardcoded way of handling RBAC.

After the upgrade the rules of Root Admin Role look like this:

…meaning all APIs are allowed.

Other roles have each individual API rule explicitly added (if available). See part of the Domain Admin rules for reference:

Here’s a link to the official CloudStack Admin documentation

About the author

Boris Stoyanov is Software Engineer in testing at ShapeBlue, The Cloud Specialists. Bobby spends his time testing features for the Apache CloudStack Community and for our ShapeBlue clients.