Hosts stuck in PrepareForMaintenance state | CloudStack Feature First Look
The process of setting a host into maintenance in CloudStack requires an administrator to ask for ‘prepare for maintenance’, either via API or through the UI on a host. When CloudStack receives the request to prepare the host for maintenance, the host state is set to ‘PrepareForMaintenance’ and any VM running on the host start to be migrated away. Ideally, the process lasts until there are no VMs left running on the host and it can safely enter Maintenance mode.
However, in case of failure with these VM migrations, the host can stay indefinitely in the ‘PrepareForMaintenance’ state. This does not give useful information to the administrators, as it could mean that CloudStack is still trying to migrate away VMs or the process simply failed. In this last case, the administrator needs to cancel maintenance, fix any problem and try again preparing the host for maintenance.
This feature tackles the infinite state problem, by giving more control to administrators when preparing a host for maintenance with the following changes:
- Set the maximum number of attempts to migrate VMs away from hosts preparing to enter maintenance mode by the global setting ‘vm.ha.migration.max.retries’.
- If there are errors during the migrations of VMs, the host is marked in a new state ‘ErrorInPrepareForMaintenance’. While the host stays in this state admins can correct errors and host state will update on next iteration of checks by management server.
- In case the maximum number of attempts is reached for every VM on a host preparing to enter maintenance mode, and migrations could still not be completed, then the host is marked as ‘ErrorInMaintenance’ state.
This means that the new behavior for preparing a host into maintenance is the following:
- To enter maintenance mode, every VM must have been migrated away from a host
- When a host is preparing to enter maintenance mode, the following must be met:
- If after the number of attempts to migrate a VM on a ‘PrepareForMaintenance’ host reaches its limit, then no further migration attempts will be rescheduled for that VM.
- If migration attempts including all subsequent retries for any VM on a ‘PrepareForMaintenance’ host have failed, then the host must transit to ‘ErrorInMaintenance’ state.
- A host must transit to ‘ErrorInMaintenance’ only when it is preparing to enter maintenance mode, and one or more VMs could not be migrated away from the host after the number of migration attempts for each VM is consumed.
- Running VMs on a host preparing to enter maintenance must not be stopped if any migration attempt fails
- A host must still be able to enter maintenance mode when there are no failures on migrations as before