Marvin: “I think you ought to know I’m feeling very depressed.”
Trillian: “Well, we have something that may take your mind off it.”
Marvin: “It won’t work, I have an exceptionally large mind.“

Trillian was born from the need for us to create environments which we could run CloudStack’s Marvin test framework against, but the variety of uses for a versatile tool to create cloud environments on demand quickly became clear. We have used nested virtualisation to hand-craft cloud environments for quite a while now, however we needed automation around it so that environments could be created quickly, easily and consistently. Taking inspiration from the Hitchhiker’s Guide To The Galaxy, we grabbed our towels and started work.

We started with a list of high level use cases:

  • Test new feature software builds (manually and via Marvin)
  • Test community releases (manually and via Marvin)
  • Replicate failure scenarios for support clients
  • Evaluate new features
  • Evaluate complementary technologies

 

On top of this we had a number of other requirements:

  • Environments should be as close to production as possbile
  • Support as many hypervisors as possible
  • Support as many CloudStack environment permutations as possible
  • Support ‘multi-tenancy’ such that a number of clouds can be created and or running at the same time
  • Easy to connect to external integration points such as SolidFire storage, NetScalers, Cloudian S3 installations, etc.
  • ‘Hard code’ as little as possible
  • Make it portable/replicable so that we could share it
  • Minimise the number of tools/technologies used
  • Enable CI/CD integration (particularly with Jenkins)
  • Reasonably quick deployment times

 

So Trillian became a tool to build realistic, fully functioning CloudStack cloud environments from a simple command line statement, while still fulfilling the further requirements which we’d set out.

Now that Trillian is at a V1.0 stage, we’d like to share our work for anyone to use, and/or contribute to.

What we used

For an underlying hypervisor that could reliably support nested virtualisation, KVM or ESXi were the standout choices. Our development team’s go-to hypervisor is KVM, but after a certain amount of kernel hacking, the number of workarounds that we were needing to employ made us look to my personal favourite again: ESXi. Using a few tricks that I have come across in the past, I knew that we could relatively simply create fully functioning nested clouds.

So our next question was orchestration. Well, we have the world’s best cloud orchestration platform at our fingertips (ahem) – CloudStack. But we still needed to drive it with something and there was still a lot of configuration to do. The flexibility that we required ruled out creating templates for every possible hypervisor and mgmt VM that we might need, so we went to another personal favourite – Ansible.

Ansible: The King of Config Management and Automation

Ansible allowed us to keep the number of tools down to pretty much 1. We considered additional tools such as Packer, Teraform etc, but we felt we were just adding more and more layers and dependencies, when we could just use Ansible.

Our great friend Rene Moser added CloudStack modules into Ansible 2.0 and 2.1, giving us the ability to create individual projects to put each nested cloud in, and to create the nested cloud VMs directly from Ansible. With the updates and fixes in Ansible 2.1 we are able to configure vSphere environments on-the-fly as well. Simple SSH connections to XenServer and KVM hosts put them at our mercy, ditto the CloudStack management hosts, MySQL hosts, and Marvin hosts which all run on CentOS or Ubuntu.

Once we could create and configure every individual component it became a matter of stitching it all together in a way that remained flexible.

How it works

Our setup is split into two parts.  First is our generate-cloudconfig play.  This takes a set of Ansible extra-vars, does some checks that they seem valid and then generates a host file and a groupvars file that describe the environment that you require. For very complicated environment architectures these files can be manually changed to reflect components or architectures which cannot be described on a simple command line. An example instantiation of the play might be:
ansible-playbook generate-cloudconfig.yml -i localhost --extra-vars "env_name=cs49-vmw55-pga env_version=cs49 mgmt_os=6 hvtype=v vmware_ver=55u3 hv=2 pri=2 env_accounts=all"
This example creates a project called “cs49-vmw55-pga” based on CloudStack 4.9, with a Centos6(.8) mgmt server and 2 ESXi 5.5u3 hypervisor hosts. The cluster will have 2 primary storage pools and all accounts will have permission to access the project. Trillian understands that a vSphere environment requires a vCenter server as well, and adds that to the inventory.

A global variable file holds the mappings of CloudStack versions, host hypervisor types/versions and system VM URLs, abstracting many variables away from the user.  However EVERY variable can be overridden in the extra-vars.  For instance, we can specify a specific repo to build the management server from using baseurl_cloudstack=http://10.2.0.4/shapeblue/cloudstack/testing/ or sec=2 would create 2 secondary storage pools.

Trillian’s Intelligence

There is one aspect that we haven’t covered here, and that’s the creation of a zone using these components that actually works, particularly when multiple clouds are in existence at the same time.

One way around this is to encapsulate (or nest) the nested clouds so that they can each use the same IP space without tripping over each other. However, this does cause a massive performance hit and complicates deployment somewhat. But the main problem for us with that approach is that we need the environments to appear as close to a production cloud as possible, which includes ‘direct access’ to the public and management networks.

So we used two techniques. The first was to create a shared network on VLAN 4095 and have the parent CloudStack present that to nested hypervisors for guest and public traffic. VLAN 4095 causes ESXi to trunk all VLANs, allowing us to pass guest traffic between nested guest VMs, even if they’re on different physical hosts. The second was to create a (MySQL) database of VLAN and management/public IP address ranges. The IP address ranges for system VMs share a common gateway, but crucially do not overlap. We then request a range of guest VLANs and IP address ranges for public and management networks for a new environment and we are returned unused ranges, ensuring that all cloud environments can co-exist. Once an environment has served it’s purpose, the play to remove the VMs also marks the used ranges as available again in the database.

The Heavy Lifting

Once we’ve created the environment configuration files and been assigned a guest VLAN range and public and management IP ranges we now can build and configure everything.

First we build all of the VMs. These can include mgmt server(s), dedicated MySQL server(s) (default is to run MySQL on the primary mgmt server), KVM hosts, XenServer hosts, ESXi hosts, vCenter host and/or a Marvin host.

We next install CloudStack and MySQL, and then KVM + CloudStack agent on the KVM hosts or configure the XenServer hosts and create a pool or configure the ESXi hosts and add them to the vCenter as a cluster.

Next we create the primary and secondary storage pools and seed the relevant system VM template files.

The final step is to create the zone on the mgmt server. This is done using an Ansible template of a Cloudmonkey script which takes all of the environment variables and produces a ready-to-run script, which Ansible kindly then does. The Ansible template allows loops and conditionals, so the template can deal with any number of hosts or storage pools.

The result is a CloudStack environment which has running system VMs and is happily downloading the default templates.

For test and development purposes we have the additional arguments build_marvin=yes and wait_till_setup=yes  These build a Marvin host and generate the Marvin cfg file based on the deployed environment, while wait_till_setup will poll for the system VMs to be in an ‘Up’ state and only return when the environment is fully ready.

Tips and Tricks

There were a number of tweaks we did here and there, and we’ll be updating out documentation to reflect them, in order to make the journey as easy as possible for anyone else who would like to use this. As a taster, we:

  • Enabled promiscuous mode and forged transits on the parent hosts to allow traffic to from nested guest VMs and then added VMware labs’ dvfilter on nested hosts to protect network performance
  • Created a GeneraliseXenServer script to allow us to clone XenServers
  • Created a GeneraliseESXi script to allow us to clone ESXi hosts
  • Enabled vmware.nested.virtualization in the parent CloudStack
  • Found that VMXNet3 vNICs are great, except for nested hosts, where they do weird and wonderful things. E1000 is steady but sure.

Summary

So, Trillian gives us an extremely flexible way to quickly build cloud environments for a multitude of purposes. We’re at version 1.0, and it’s now at a quality that we’re ready to opensource it, and share it with anyone who can make use of it.

We still have more plans, including Hyper-V support and we’re working on our documentation and making it as simple as we can to create the parent CloudStack configuration and templates.

If you’d like to have a look, download or submit a pull request please go here: https://github.com/shapeblue/Trillian

 Acknowledgements

Trillian was initially developed by Paul Angus, Dag Sonstebo and Glenn Wagner

About The Author

Paul Angus is VP Technology & Cloud Architect at ShapeBlue, The Cloud Specialists. He has designed and implemented numerous CloudStack environments for customers across 4 continents, based on Apache Cloudstack.

Some say that when not building Clouds, Paul likes to create Ansible playbooks that build clouds. And that he’s actually read A Brief History of Time.