Scaling Apache CloudStack to 100,000 Hosts and Millions of Instances

Abstract

Apache CloudStack, a proven open‑source IaaS platform, has demonstrated strong scalability in production deployments with thousands of hypervisors. However, reaching hyperscale targets—100,000 hosts and millions of instances—revealed architectural and resource bottlenecks. This whitepaper analyses those challenges and describes enhancements such as database optimisations, improved agent communications, caching, and concurrency handling. These improvements, validated through rigorous benchmarking, were introduced in the 4.20 series.

Introduction

Motivation: Cloud providers and large enterprises require cloud platforms that scale seamlessly beyond current upper limits.

Objective: Extend CloudStack’s capabilities to support 100,000 hosts and millions of Instances while maintaining performance and reliability.

Approach: Combine simulator-based and agent-based testing, profiling, and iterative optimisations targeting critical subsystems.

Baseline & Challenges

Existing Scale: Production deployments routinely handle 5,000 hosts. To test hyperscale boundaries, a simulator-based environment using KVM Agents was built.

Identified Bottlenecks:

Centralised Database: High query load exposed missing indexes and redundant operations.
API & Concurrency: Heavy automation and parallel workflows stressed API endpoints.
Agent Communication: TLS handshake bottlenecks and slow restart behaviour.
Lack of Fault Tolerance: Failures frequently propagated across subsystems.
Storage Connectivity: Sequential worker processing limited throughput.
Resource Scheduling: Large Clusters degraded deployment efficiency.

Optimisation Strategies

Database Layer

Switched to HikariCP

Starting with CloudStack 4.20.0 introduced configurable database connection pooling via the db.<DATABASE>.connectionPoolLib property in db.properties. In hyperscale tests, switching from DBCP2 to HikariCP reduced blocked threads and improved memory scalability. A benchmark with 50,000 simulator Hosts and 10,000 Instances demonstrated clear advantages for HikariCP.

Figure 1 – With DBCP

Figure 2 – With HikariCP

Optimisation for Connection Ping

For JDBC4-compliant MySQL connectors, enabling the Connection.isValid gAPI reduced latency extra SELECT 1 queries, improving throughput.

Database Retrieval Optimisations

Optimisations included:

Avoiding full-table fetches when only counts are needed.
Selecting only the required columns instead of entire rows.
Reducing repetitive retrieval of identical data.

These changes lowered the database. These changes lowered database load and improved API responsiveness.

Agent–Server Communication

TLS Concurrency Enhancements

The handshake and connection acceptance flow between Agents and the Management Server were optimised to handle significantly more concurrent connections. This improved reconnection speed and throughput, especially when thousands of Agents reconnected after a Management Server restart.

Mock Agent Plugin for Testing

A mock Agent plugin was developed to simulate large-scale environments. This plugin mimics a KVM-like Agent but executes hypervisor requests as NO-OPs. It enabled controlled, repeatable testing of reconnections, load-balancing, and high-throughput scenarios without requiring thousands of physical Hosts.

Figure 3 – Significant improvements seen in agent reconnection time before and after changes. In the configuration of 2 Management Servers, 10,000 mock Agents were first connected to the first Management Server and then made to reconnect on the second at once.

Caching Improvements

Caffeine Cache Adoption

CloudStack now uses the Caffeine caching library, providing higher hit rates and lower latency.

Configuration Key Caching

Frequently accessed configuration values and now cached, reducing database lookups by ~75%. This improved API response times and reduced contention under heavy load.

Figure 4 – Before changes

Figure 5 – After changes

Concurrent Workers

Background tasks such as storage connections and capacity calculations now use concurrent worker threads. Running tasks in parallel significantly improved throughput in environments with tens of thousands of Hosts.

Figure 6 – Storage connection performance before and after with 50k hosts. After the change, worker count was set to 2.

JVM & System Tuning

Adopted G1GC: Reduced garbage collection pauses times in large-scale environments.
Heap and OS Tuning: JVM heap sizes and OS parameters were tuned for stability and scalability under heavy load.

Results

Faster Agent connection and reconnection.
Reduced deployment time for Instances in large-scale concurrent scenarios.

Figure 7 – Instance deployment time improvement in a large-scale test with 50 concurrent workers (10 Instances each) in an environment with 25,000 existing Instances and 50,000 hosts in a single zone.

Figure 8 – Instance deployment time across successive batches of 10,000 VMs (50 concurrent workers × 200 Instances each) in a 50,000-host zone. Deployment time improved considerably with CloudStack 4.20, though it increased gradually with each additional batch.

Performance becomes inversely proportional to the number of Hosts in a Cluster

Figure 9 – Average Instances deployment time for different host counts per cluster, with 50,000 Hosts evenly distributed across Clusters and 10,000 Instances deployed using 50 concurrent workers.

CloudStack sustained 50–100 concurrent deployment workers efficiently on a Management Server with only 2 vCPUs and 16 GB RAM.

Figure 10 – 50-100 workers were given a sweet spot when behaviour was tested against different numbers of workers, each deploying 10 Instances.

Implementation & Release Notes

20.0

HikariCP, Caffeine caching.

20.1

Agent, concurrency improvements, caching enhancements, and API/Server optimisations.

Roadmap & Next Steps

Extend scaling Domains, Users, and Volumes.
Improve allocator and scheduler efficiency.
Refactor legacy modules.
Optimise background task handling.

Conclusion

Starting with CloudStack 4.20, the platform has reached a new milestone in hyperscale readiness by addressing key bottlenecks in database efficiency, agent communication, caching, and concurrency. These improvements establish a solid foundation for operating at scales of 100,000 Hosts and millions of Instances. From 4.21 onward, further optimisations—covering domains, Users, storage volumes, and scheduling—will continue to extend this work, strengthening CloudStack’s ability to deliver consistent performance and reliability in large-scale deployments.

Abhishek Kumar

Abhishek Kumar is a software engineer by profession. His personal interests and hobbies are technology, politics and sports. Abhishek is experienced in development and management of a variety of desktop and mobile applications. He has a particular interest in mobile application development, designing and developing highly interactive and intuitive mobile, desktop applications GUI.
Abhishek became part of ShapeBlue in 2019 and is currently an active Apache CloudStack Committer.

You can learn more about Abhishek and his background by reading his Meet The Team blog.

Apache CloudStack

Our Services

RESOURCES

About ShapeBlue

Contact