Working towards CloudStack zero downtime upgrades

,

As most people know, Apache CloudStack has gained a reputation as a solid, low maintenance dependable cloud orchestration platform. That’s why in last year’s Gartner Magic Quadrant so many leaders and challengers were organisations underpinning their services with Apache CloudStack. However, version upgrades – whilst being much simpler than many competing technologies – have always been the pain point for CloudStack operators. The irony is that upgrading CloudStack itself is usually relatively painless, but upgrading its distributed networking Virtual Routers often results in network downtime for users for a number of minutes, requiring user maintenance windows.

At ShapeBlue we have a vision that CloudStack based clouds – whatever their size and complexity – should be able to be upgraded with zero downtime. No maintenance windows, no service interruptions: zero downtime. Achieving this will allow all CloudStack users/operators to benefit from the vast array of new functionality being added by the CloudStack community in every release.

We set out on the journey towards zero downtime a number of months ago and have been working hard with the CloudStack community on the first steps (it is important to note that “we” includes many people in the CloudStack community who have contributed to this work). Below, I set out the detail of what we’ve achieved so far and what we hope to be achieving in the future, but if readers just want the headline: CloudStack 4.11.1 has up to an 80%+ reduction in network downtime during upgrades compared to CloudStack 4.9.3, and downtime is near eliminated when using redundant VRs.

What’s the problem when upgrading?

During upgrades, CloudStack’s virtual routers (VRs) have to be restarted and usually destroyed and recreated (this also sometimes has to be done during day-to-day operations, but is most apparent during upgrades). These restarts usually lead to downtime for users – in some cases up to several minutes. Whilst Redundant Virtual Routers can mitigate against this they do have some limitations with regards to backward compatibility and are therefore not always a solution to the problem.

Downtime reductions in CloudStack 4.11.1

With the changes made in 4.11.1 (described below) we have managed to achieve significant reductions in network downtime during VR restarts. Please note these improvements will vary from one environment to another, and will be dependent on hypervisor type, hypervisor version, storage backend as well as network bandwidth, so we suggest testing in your own environment to determine the benefits. We’ve tested with a typical VR configuration in our virtualised lab environment.

The testing setup used is as follows:

  • CloudStack 4.9.3 and 4.11.1 environments built in parallel. To maintain the same hypervisor versions across both tests the following hypervisor versions were used:
    • VMware vSphere 5.5
    • KVM on CentOS7
    • XenServer 7.0
  • Environment configuration: In each test we build a simple isolated network with:
    • 10 VMs
    • 10 IP addresses
    • Firewall rules configured on all IP addresses

Downtime was measured as follows:

  • For egress traffic we measured the total amount of time an outbound ping would fail from a hosted VM during the restart process.
  • For ingress traffic we assumed a hosted service on a CloudStack VM and measured the amount of time SSH would be unavailable for during the restart process.
  • In all tests we carried out a “restart network with cleanup” operation and measured the above times. Note – with the new parallel VR restart process (see below) we no longer care how long the overall process takes – we are only interested in how long the network is impacted for. As a result we’ve simply measured the sum of time services were unavailable for (note this time may in some cases be a sum of multiple downtime periods).
  • Tests were repeated multiple times and average number of seconds calculated for ingress and egress downtime across tests for each hypervisor. To illustrate our best case scenarios we’ve also included the shortest measured downtime figure.

Results are as follows:

EnvironmentACS 4.9.3 avgACS 4.11.1 avg (lowest)Reduction avg (highest)
VMware 5.5119s21s (12s)
82% (90%)
KVM / CentOS744s26s (9s)40% (80%)
XenServer 7.0181s33s (15s)82% (92%)

 

How these results were achieved

Existing improvements made in CloudStack 4.11

A number of changes were made in CloudStack 4.11 designed to improve VR restart performance:

  • The system VM has been upgraded from Debian 7 (init based) to Debian 9 (systemd based)
  • The patching process and boot times have been improved, and we have also eliminated reboots after patching
  • The system VM disk size has been reduced, leading to faster deployment time.
  • The VPN backend in the VR has been upgraded to Strongswan, which provides improved VPN performance
  • The redundant VR (RVR) mechanisms have been improved.
  • Code base has been refactored, and it is now easier to build and maintain VR code
  • A number of stability improvements made

 Changes  in CloudStack 4.11.1 – Parallel VR restarts

CloudStack 4.11.1 will ship with a new feature: Parallel VR Restarts, which  changes the behaviour of the “restart network with cleanup” option. In previous CloudStack versions this method would be a serial action where the original VR would be stopped and destroyed and then a new VR started. In CloudStack 4.11.1 this has now been changed to a parallel process where a “restart with cleanup” means:

  • A new VR is started in the background while the old one is still running and providing networking services.
  • Once the new VR is up and has checked in to CloudStack management the old VR is simply stopped and destroyed.
  • This is followed by a last configuration step where ARP caches at neighbours are updated.

With this method there is no negotiation between old and new VR, CloudStack simply orchestrates the parallel startup of the new VR. As a result this method does not have any pre-requisites around the version of the original VR – meaning it can be used for VR restarts after upgrade from considerable older CloudStack versions to 4.11.1.

It is worth noting that this 4.11.1 feature does not make large reductions in the actual  VR processing time itself – however with the parallel startup this doesn’t affect network downtime, and the downtime itself is more connected to the final handover of network processing from old to new VR.

In addition to the considerable reduction in normal VR restart downtime, this feature also introduces a much improved redundant VR restart – this comes close to eliminating network downtime when redundant VR networks are restarted, but does obviously mean the old and new VRs need to be version compatible. In our own testing we have seen downtime for redundant VR networks near eliminated.

Coming in future versions

Advanced  parallel restarts

The next step on the journey is to add further handshaking between old and new VR:

  • New VR will be started in parallel to old, but with some network services and / or network interfaces disabled.
  • Once new VR is up CloudStack management will do an external handover from old VR to new, i.e. handle VR connectivity via the hypervisor.

Fully negotiated redundant VR restarts

The last step on the journey will be aiming towards a fully redundant handover from old to new VR:

  • In this final step the end goal is to make all VRs redundant capable, which will reduce same version restart times as well as future upgrade restart times.
  • New VR will again be started in parallel to old, but will be configured with the redundancy options currently used in the RVR.
  • Once new VR is up the old and new VRs will internally negotiate the handover of all networking connectivity and services, before the old VR is shut down.

– – –

During this journey there are a number of tasks needing carried out – both to make the VR internal processing more efficient as well as improving the backend network restart mechanisms:

  • General speedup of IPtables rules application
  • Fix and improvement of the DNS / DHCP configuration to eliminate repetition of processing steps and cut down on processing time
  • Further improvements of redundant Virtual Router: VRRP2 configuration, and/ or move to a different VR HA solution
  • A move to make all VRs redundant capable by default
  • Move from python2 to python3
  • Consider a move from IPtables to NFtables
  • Converge and make network topologies flexible, refactor and merge VPC and non-VPC code base

Conclusion

With the changes implemented in 4.11.1 we have already made a huge step forward in reducing network downtime as part of VR restarts – whether this is during day to day operation or as part of a CloudStack version upgrade. With downtime reduced by up to 80% and average figures of less than 30 seconds this is a considerable improvement – and this is only the first step on the journey.

We have not yet achieved our goal of “zero downtime upgrades” but it is worth considering that the network interruptions that CloudStack can now achieve during an upgrade will be less than the timeouts for many applications and protocols.

In the coming CloudStack versions we hope to continue this development and further reduce the figures, working towards the ultimate goal of “zero downtime upgrades”. 

About The Author

Dag Sonstebo is a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends his time designing, implementing and automating IaaS solutions based around Apache CloudStack.