Share:

Scaling Apache CloudStack to 100,000 Hosts and Millions of Instances

Abstract

Apache CloudStack, a proven open‑source IaaS platform, has demonstrated strong scalability in production deployments with thousands of hypervisors. However, reaching hyperscale targets—100,000 hosts and millions of instances—revealed architectural and resource bottlenecks. This whitepaper analyses those challenges and describes enhancements such as database optimisations, improved agent communications, caching, and concurrency handling. These improvements, validated through rigorous benchmarking, were introduced in the 4.20 series.

Introduction

Motivation: Cloud providers and large enterprises require cloud platforms that scale seamlessly beyond current upper limits.

Objective: Extend CloudStack’s capabilities to support 100,000 hosts and millions of Instances while maintaining performance and reliability.

Approach: Combine simulator-based and agent-based testing, profiling, and iterative optimisations targeting critical subsystems.

cloudstack graph

Baseline & Challenges

Existing Scale: Production deployments routinely handle 5,000 hosts. To test hyperscale boundaries, a simulator-based environment using KVM Agents was built.

Identified Bottlenecks:

  • Centralised Database: High query load exposed missing indexes and redundant operations.
  • API & Concurrency: Heavy automation and parallel workflows stressed API endpoints.
  • Agent Communication: TLS handshake bottlenecks and slow restart behaviour.
  • Lack of Fault Tolerance: Failures frequently propagated across subsystems.
  • Storage Connectivity: Sequential worker processing limited throughput.
  • Resource Scheduling: Large Clusters degraded deployment efficiency.

 

Optimisation Strategies

Database Layer

  • Switched to HikariCP

Starting with CloudStack 4.20.0 introduced configurable database connection pooling via the db.<DATABASE>.connectionPoolLib property in db.properties. In hyperscale tests, switching from DBCP2 to HikariCP reduced blocked threads and improved memory scalability. A benchmark with 50,000 simulator Hosts and 10,000 Instances demonstrated clear advantages for HikariCP.

DBCP CloudStack

Figure 1 – With DBCP

HikariCP CloudStack

Figure 2 – With HikariCP

  • Optimisation for Connection Ping

For JDBC4-compliant MySQL connectors, enabling the Connection.isValid gAPI reduced latency extra SELECT 1 queries, improving throughput.

 

  • Database Retrieval Optimisations

Optimisations included:

  • Avoiding full-table fetches when only counts are needed.
  • Selecting only the required columns instead of entire rows.
  • Reducing repetitive retrieval of identical data.

These changes lowered the database. These changes lowered database load and improved API responsiveness.

Agent–Server Communication

  • TLS Concurrency Enhancements

The handshake and connection acceptance flow between Agents and the Management Server were optimised to handle significantly more concurrent connections. This improved reconnection speed and throughput, especially when thousands of Agents reconnected after a Management Server restart.

  • Mock Agent Plugin for Testing

A mock Agent plugin was developed to simulate large-scale environments. This plugin mimics a KVM-like Agent but executes hypervisor requests as NO-OPs. It enabled controlled, repeatable testing of reconnections, load-balancing, and high-throughput scenarios without requiring thousands of physical Hosts.

Agent-Server CloudStack

Figure 3 – Significant improvements seen in agent reconnection time before and after changes. In the configuration of 2 Management Servers, 10,000 mock Agents were first connected to the first Management Server and then made to reconnect on the second at once.

Caching Improvements

  • Caffeine Cache Adoption

CloudStack now uses the Caffeine caching library, providing higher hit rates and lower latency.

  • Configuration Key Caching

Frequently accessed configuration values and now cached, reducing database lookups by ~75%. This improved API response times and reduced contention under heavy load.

caching apache cloudstack

Figure 4 – Before changes

cloudstack thread groups

Figure 5 – After changes

Concurrent Workers

Background tasks such as storage connections and capacity calculations now use concurrent worker threads. Running tasks in parallel significantly improved throughput in environments with tens of thousands of Hosts.

Figure 6 – Storage connection performance before and after with 50k hosts. After the change, worker count was set to 2.

JVM & System Tuning

  • Adopted G1GC: Reduced garbage collection pauses times in large-scale environments.
  • Heap and OS Tuning: JVM heap sizes and OS parameters were tuned for stability and scalability under heavy load.

Results

  • Faster Agent connection and reconnection.
  • Reduced deployment time for Instances in large-scale concurrent scenarios.

Deploy VM CloudStack

Figure 7 – Instance deployment time improvement in a large-scale test with 50 concurrent workers (10 Instances each) in an environment with 25,000 existing Instances and 50,000 hosts in a single zone.

deploy 10k vms cloudstack

Figure 8 – Instance deployment time across successive batches of 10,000 VMs (50 concurrent workers × 200 Instances each) in a 50,000-host zone. Deployment time improved considerably with CloudStack 4.20, though it increased gradually with each additional batch.

  • Performance becomes inversely proportional to the number of Hosts in a Cluster

hosts cloudstack

Figure 9 – Average Instances deployment time for different host counts per cluster, with 50,000 Hosts evenly distributed across Clusters and 10,000 Instances deployed using 50 concurrent workers.

  • CloudStack sustained 50–100 concurrent deployment workers efficiently on a Management Server with only 2 vCPUs and 16 GB RAM.

VMs CloudStack deployed

Figure 10 – 50-100 workers were given a sweet spot when behaviour was tested against different numbers of workers, each deploying 10 Instances.

Implementation & Release Notes

  • 20.0

HikariCP, Caffeine caching.

  • 20.1

Agent, concurrency improvements, caching enhancements, and API/Server optimisations.

Roadmap & Next Steps

  • Extend scaling Domains, Users, and Volumes.
  • Improve allocator and scheduler efficiency.
  • Refactor legacy modules.
  • Optimise background task handling.

 

Conclusion

Starting with CloudStack 4.20, the platform has reached a new milestone in hyperscale readiness by addressing key bottlenecks in database efficiency, agent communication, caching, and concurrency. These improvements establish a solid foundation for operating at scales of 100,000 Hosts and millions of Instances. From 4.21 onward, further optimisations—covering domains, Users, storage volumes, and scheduling—will continue to extend this work, strengthening CloudStack’s ability to deliver consistent performance and reliability in large-scale deployments.

Share:

Related Posts:

ShapeBlue