Apache CloudStack 4.21 introduces Async Agent Command Reconciliation, a mechanism designed to improve the reliability and accuracy of long-running operations (such as Instance and Volume migrations) when interruptions occur involving the Management Server, Agent, or the network. The Feature tracks and reconciles key Commands—CopyCommand, MigrateCommand, and MigrateVolumeCommand—by utilizing Agent heartbeats and a new reconciliation workflow. This ensures that resource states remain consistent across CloudStack and KVM Hosts, even following crashes or restarts.
Feature Overview
Alongside the Cloudstack Management Server, CloudStack provides a service called CloudStack Agent, which is installed on each KVM Hosts (for example RHEL, RockyLinux, AlmaLinux, Ubuntu and Debian Hosts). When CloudStack creates, updates or deletes resources on the KVM Hosts, the CloudStack Management Server sends an internal command to a CloudStack Agent. When CloudStack Agent receives the command, it processes the command and return the answer to the Management server. The CloudStack Management Server then processes the answer upon receipt. This standard workflow is described below:
Problem description
The standard process highly relies on the stability of communication and services. When the communication link is unstable or service is not running, several issues can arise, especially for asynchronous jobs that take a long time. Apache CloudStack 4.21 introduces the new feature “Async Agent Command Reconciliation” to increase the stability of this process and, subsequently, the accuracy of resource states.
The following scenarios lead to inconsistencies,
- Intermittent connection failure, which prevents the CloudStack Management Server from receiving the CloudStack Agent’s answer.
- Agent crashes or is restarted while some jobs are still being processed in the backend. (e.g., by third parties).
- CloudStack Management Server is restarted while some jobs are being processed on the Agent side.
As a consequence of these failures, the CloudStack Management Server cannot proceed with the answer and update the state of the resources. This leads to inconsistent resource states across CloudStack Management, the Agent, or network and storage components.
Introducing Terminology: Reconcile Command
Virtual machine Instances and Volumes are critical resources in cloud environments. Since the migration of Instances and Volumes may take a long time, maintaining accurate information is essential when the previously described events occur.
In Apache CloudStack 4.21, a new term, “Reconcile Command”, is introduced to improve the accuracy of Instance and volume information. A Reconcile Command is an internal command which can be reconciled when an error occurs during the process.
Currently, there are three Reconcile Commands:
Command | Description |
CopyCommand | This command is used when copying a Template or Volume between Primary Storage and Secondary Storage. It is also used in some scenarios to migrate Volumes from Primary Storage to another Primary Storage. |
MigrateCommand | This command is used when migrating a running Instance (with or without Volumes) to another storage pool. |
MigrateVolumeCommand | The command is used when migrating a Volume between storage pools in some scenarios. |
Supported Operations and Storage Pools
The Feature has been tested for online Instance Migration and online/offline Volume Migration on the following storage pools:
- Local to NFS storage
- NFS to Local storage
- Local to Local storage
- NFS to NFS storage
- PowerFlex to PowerFlex storage
Global settings
The feature is disabled by default. Several Global Settings are available to control the Feature:
Name of Global Configuration | Default Value | Description |
reconcile.commands.enabled | false | Indicates whether the background task to reconcile the commands is enabled or not. |
reconcile.commands.interval | 60 | Interval (in seconds) for the background task to reconcile the commands. |
reconcile.commands.max.attempts | 30 | The maximum number of attempts to reconcile the commands. |
reconcile.commands.workers | 100 | The Number of worker threads to reconcile the commands |
Solution/New Workflow
For Non-Reconcile Commands, the process remains unchanged. For Reconcile Commands, the process is integrated with the Agent heartbeat, which runs periodically (depends on the Global Setting ping.interval).
The main differences in the workflow are:
- The Management Server creates a record in the reconcile_commands table before sending the Reconcile Command to the Agent.
- The Agent updates the Command/answer in a JSON file while processing the command.
- The Agent syncs with Management Server by sending heartbeat (PingCommand) every 60 seconds (based on ping.interval)
- When the Management Server receives the heartbeat (PingCommand) from the Agent, it updates the information of Reconcile Commands in the database and sends the Commands in PingAnswer to the Agent.
- When the Agent receives the PingAnswer from the Management Server, it removes the JSON file if the state is COMPLETED or FAILED.
If the Agent crashes, is restarted, or Management server is restarted:
- The Management Servers will reconcile the Commands by sending an internal ReconcileCommand to Agents. This checks the state of resources on the KVM Hosts or on storage.
- Once the state is determined, Management Server updates the information of Instance of Volume.
Conclusion
With the new Feature, “Async Agent Command Reconciliation” in Apache CloudStack 4.21, CloudStack Users will receive more accurate information regarding Instances and Volumes when unexpected events occur involving CloudStack Management Server, CloudStack Agent, or their connection.
References
https://github.com/apache/cloudstack/pull/10514
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Async+Agent+Command+Reconciliation

Wei Zhou works as a Software Architect at ShapeBlue. He has many years experience on cloud computing and a passion for various cutting-edge knowledge. Wei works on software design and implementation, as well as resolving issues for customers and community users. Wei has been a committer of the Apache CloudStack project since 2013 and a PMC member since 2017.