Posts

Thursday, May 27 was the first-ever vCSEUG, and the first time the community had met since February 2020 (the last pre-COVID meetup in Berlin). It was time to reconnect! We have missed the chance to interact with other community members, learn what’s new in CloudStack and hear from the companies using it. So, we needed to find a solution to overcome all barriers. Organizing a virtual event was not only a great chance for the EU User Group members to meet but also to invite CloudStack community members and contributors from all around the globe to join in.

About the Attendees and Speakers

The vCSEUG proved to be a huge success. People joined from 23 countries and 4 continents (by my count) – from Germany, UK, Switzerland, India, Bulgaria, Greece, Poland, Serbia, Brazil, Chile, Russia, USA, Canada, Japan, France, Uruguay, Korea … (apologies if I have missed anyone)! We also had a record number of registrations and attendees for a CloudStack User Group Event. Physical distance was not an issue for our speakers, who joined the event from 6 different countries.

As usual, this was a half-day event, but we chose to have shorter talks so that we could accommodate a wider range of topics, proving that when you want to learn new things and meet the CloudStack community, nothing can stop you. During the day, we enjoyed 6 sessions from industry leaders, with conversation continuing in the ‘vPub’ after the last talk.

Without the need for sponsorship for a venue and refreshments, instead, we surprised some of the attendees with great prizes provided by the event sponsors – ShapeBlue, LINBIT, Dimsi, StorPool and PC Extreme – all of whom helped us to make the virtual event happen.

 

vCSEUG Proved to be a Huge Success

You may ask: “What is the secret behind making a virtual event happen?”. We think it is the great community we have and its dedication to doing something valuable and exciting after a long time of being not able to meet. This was evident by the number of attendees and the quality of talks, questions and ongoing collaboration.

If you missed the event or some of the talks, we are happy to share recordings as usual and will make all slides available on SlideShare. Continue reading and discover more about the event sessions and our awesome speakers.

 

vCSEUG Talks and Presentations

 

What’s New in CloudStack 4.15 – Giles Sirett

vCSEUG started with a talk from Giles Sirett, Chairman of the CSEUG and PMC member, Apache CloudStack. Giles welcomed the attendees and shared in-depth insights about the new features and functionalities in CloudStack 4.15. He also provided info on when 4.15.1 and 4.16 are expected, presented the new VP of Apache CloudStack (Gabriel Brascher), talked through the latest integrations to CloudStack, improvements in the UI, new OS support, advanced capabilities of vSphere, OVF support, dynamic roles enhancements and more. Giles talk in full here:

 

Customising the CloudStack UI – Abhishek Kumar

The next talk came from Abhishek Kumar, Software Developer at ShapeBlue, who focused on customizing the new UI. It aimed to teach administrators how to tailor the UI aesthetics to their organization’s preferences. The session targeted advanced users, training them to alter the layout of the UI, add, remove, or restrict resource actions from the UI. You can read more on how to customize the CloudStack UI on our blog and watch a full recording of Abhishesk’s session:

 

From Мetal to Service: 100% automation with Apache CloudStack and Ansible – Rafael del Valle

The next talk was presented by Rafael del Valle, Co-Founder of Celpax. His session was “From metal to service: 100% automation with Apache CloudStack and Ansible”. Celpax.com has recently deployed Apache CloudStack on Hetzner+Premises with full metal to service automation. In this talk, Rafael presented their success story. Furthermore, he shared why they chose open-source technologies and what advantages they got.

 

 

KVM High Availability Regardless of Storage – Gabriel Brascher

One of the most highly-anticipated talks was by the new Apache Cloudstack VP – Gabriel Beims Bräscher – talking about KVM High Availability Regardless of Storage. One of the great advantages of CloudStack is that it is vendor-independent, meaning you can decide the technology stack above and below based on your experience or needs. Having High Availability enabled for KVM hosts can improve greatly the QoS by handling (fence/recover) a problematic Host as well as re-starting its stopped VMs on healthy hosts. However, there is a limitation on CloudStack HA for KVM – it relies mainly on NFS heartbeat script checks. Gabriel’s talk illustrated how CloudStack HA works for KVM hosts and presented a way of improving its implementation in a way that KVM HA works with any storage system pluggable on KVM, not just NFS.

 

CloudStack and Tungsten Fabric SDN Integration – Simon Weller, Radu Todirica

After a short break, we continued with CloudStack and Tungsten Fabric SDN Integration Update, presented by Radu Todirica and Simon Weller from Education Networks of America (ENA). Over the past year, ENA and EWERK have been collaborating on a new ACS plugin for the Tungsten Fabric SDN controller. Simon Weller and Radu Todirica provided an update on progress, feature overview and a live demo.

 

CloudStack Deployments for Edge Use Cases – Rudraksh Kulshreshtha

The honour of the last talk of the day was given to Rudraksh Kulshreshtha from IndiQus, who presented how to architect lean CloudStack deployments for Edge use cases.

 

After the last talk, and the usual round of questions and answers, we headed over to the vPub where conversation, collaboration and debate continued. We would like to thank all attendees, sponsors and people engaged in the event who made it happen.

See you soon on another virtual or maybe live event …

 

Introduction

In the previous two parts of this article series, we have covered the complete Ceph installation process and implemented Ceph as an additional Primary Storage in CloudStack. In this final part, I will show you some examples of working with RBD images, and will cover some Ceph specifics, both in general and related to the CloudStack.

RBD image manipulations

In case you need to do some low-level client support, you can even try to mount that image as the local disk on any KVM (or Ceph) node. For this purpose, we use a tool, conveniently named, “rbd”, which is used to operate RBD images in general (i.e. to create new images, snapshots, clones, delete images, etc.).

From any KVM node, let’s attempt to use rbd kernel client to mount an image:

[root@kvm1 ~]# rbd map cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
rbd: sysfs write failed
RBD image feature set mismatch. Try disabling features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

As you can, see the above map command will fail. This happens as we are running on default kernel 3.x on the latest CentOS 7.6 and some of the new Ceph image features are not supported by the kernel client. Actually, even upgrading kernel to 5.0 will not bring Ceph kernel client to a usable state for our Mimic (or even Luminous) cluster – so one could wonder why kernel client exists at all….

So what we should do is to use rbd-nbd tool to map an image. Rbd-nbd is a client for RADOS block device (RBD) images similar to rbd kernel module, but unlike the rbd kernel module (which communicates with Ceph cluster directly), rbd-nbd uses NBD (generic block driver in kernel) to convert read/write requests to proper commands that are sent through network using librbd (user space client).

So, as stated, we are using librbd which is always on par with the cluster capabilities/features. But here the fun begins – the NBD kernel driver is not available by default with CentOS 7 (Red Hat decided to not include it with their kernel), so you have a couple of options: either you can rebuild the specific kernel version from the official kernel sources and extract the NBD kernel module or you can upgrade a kernel to the one provided by a well-known Elrepo repository, which I find simpler and easier to manage in the long run. For those of you possibly not familiar with Elrepo repository, this is a community repository, providing many different kinds of additional packages for Centos/RHEL, including fresh kernel versions – Elrepo kernel packages are built from the official sources while the kernel configuration is based upon the default RHEL configuration with added functionality enabled as appropriate.

A simple way to move to Elrepo kernel would be as following:

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
yum install https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum --enablerepo=elrepo-kernel install kernel-lt

If the above URLs become invalid, please find the new one at https://elrepo.org/tiki/tiki-index.php.

Now that we’ve got the new kernel installed, let’s check the order of kernels offered for boot:

[root@kvm1 ~]# awk -F\' /^menuentry/{print\$2} /etc/grub2.cfg
CentOS Linux (4.4.178-1.el7.elrepo.x86_64) 7 (Core)
CentOS Linux (3.10.0-957.10.1.el7.x86_64) 7 (Core)
CentOS Linux (3.10.0-693.11.1.el7.x86_64) 7 (Core)
CentOS Linux (3.10.0-327.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-76b17df4966743ce9a20fe9a7098e2b6) 7 (Core)

Above we see our new kernel (version 4.4) on position 0, so let’s set it as the default one, and reboot:

grub2-set-default 0
reboot

Finally, with the NBD driver in place (as part of new kernel), let’s install rbd-nbd and mount our image:

[root@kvm1 ~]#  yum install rbd-nbd -y 
[root@kvm1 ~]#  rbd-nbd map cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
/dev/nbd0

The above map command completed successfully (make sure that the image/volume is not mounted elsewhere to avoid file system corruption) and we can now operate /dev/nbd0 as any other locally attached drive, i.e.:

[root@kvm1 Ceph]# fdisk -l /dev/nbd0
 
Disk /dev/nbd0: 5368 MB, 5368709120 bytes, 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes
Disk label type: dos
Disk identifier: 0x264c895d
 
     Device Boot      Start         End      Blocks   Id  System
/dev/nbd0p1            8192    10485759     5238784   83  Linux

You can also show any other mapped images (if any):

[root@kvm1 Ceph]# rbd-nbd list-mapped 
id    pool       image                                snap device
11371 cloudstack ad9a6725-4a65-4c8b-b60a-843ed88618be -    /dev/nbd0

When done with this, unmap the RBD image with:

[root@kvm1 Ceph]# rbd-nbd unmap /dev/nbd0 

Alternatively, just to make the above exercise more complete, we can also attach the RBD image to our host with qemu-nbd (as you can guess this also requires NBD kernel module, a.k.a. newer kernel):

qemu-nbd --connect=/dev/nbd0 rbd:cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
qemu-nbd --disconnect /dev/nbd0

Again, the above tool talks to librbd, which relies on ceph.conf and admin key being present in /etc/ceph/ folder.

RBD image manipulations, for real this time

Away from the NBD magic and back to “rbd” tool – let’s briefly show some usage examples to manipulate RBD images.

At this point, I would expect you to be able to create a Compute Offering of your own, targeting Ceph as Primary Storage pool (similarly to how we created a Disk offering with “RBD” tag) and create a VM from a template. Assuming you have done so, let’s examine this VM’s ROOT image on Ceph.

Let’s list all volumes in our Ceph cluster:

[root@ceph1 ~]# rbd -p cloudstack ls
d9a1586d-a30b-4c52-99cc-c5ee6433fe18
fb3ee723-5e4e-48b3-ad7d-936162656cb4

In my example above (an empty cluster), I have only 2 images present, so let’s examine them:

[root@ceph1 ~]# rbd info cloudstack/d9a1586d-a30b-4c52-99cc-c5ee6433fe18
rbd image 'd9a1586d-a30b-4c52-99cc-c5ee6433fe18':
        size 50 MiB in 13 objects
        order 22 (4 MiB objects)
        id: 1b1d26b8b4567
        block_name_prefix: rbd_data.1b1d26b8b4567
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Mon Apr  8 21:40:21 2019

[root@ceph1 ~]# rbd info cloudstack/fb3ee723-5e4e-48b3-ad7d-936162656cb4
rbd image 'fb3ee723-5e4e-48b3-ad7d-936162656cb4':
        size 50 MiB in 13 objects
        order 22 (4 MiB objects)
        id: 1b250327b23c6
        block_name_prefix: rbd_data.1b250327b23c6
        format: 2
        features: layering
        op_features:
        flags:
        create_timestamp: Mon Apr  8 21:50:10 2019
        parent: cloudstack/d9a1586d-a30b-4c52-99cc-c5ee6433fe18@cloudstack-base-snap

What we see above is the first image of 50 MB in size (here I’m using a very small template from our community friends at http://www.openvm.eu/). For the second image, we see that it has it’s “parent” set to the first image. What is happening here is that CloudStack will copy over the template from Secondary Storage (creating image d9a1586d-a30b-4c52-99cc-c5ee6433fe18), it will then create a snapshots from this image (“cloudstack-base-snap”, as shown above) and protect it (required from Ceph side), and then create an image clone ( image fb3ee723-5e4e-48b3-ad7d-936162656cb4) with a parent image being the previously created/protected snapshot. This is effectively a linked clone setup, in Ceph’s world.

Let’s quickly emulate above behavior manually – we will just create an empty file instead of copying a real template from Secondary Storage pool:

rbd create -p cloudstack mytemplate --size 100GB
rbd snap create cloudstack/mytemplate@cloudstack-base-snap
rbd snap protect cloudstack/mytemplate@cloudstack-base-snap
rbd clone cloudstack/mytemplate@cloudstack-base-snap cloudstack/myVMvolume

Finally, let’s check our “myVMvolume” image:

[root@ceph1 ~]# rbd info cloudstack/myVMvolume
rbd image 'myVMvolume':
        size 100 GiB in 25600 objects
        order 22 (4 MiB objects)
        id: fcba6b8b4567
        block_name_prefix: rbd_data.fcba6b8b4567
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Mon Apr  8 22:08:00 2019
        parent: cloudstack/mytemplate@cloudstack-base-snap
        overlap: 100 GiB

The reason we have above protected a snapshots of the main template, is that it makes it impossible to delete this template file and thus cause major damage (effectively destroy all VMs from this template). So in order to revert the exercise from above, we would need to first remove a clone (VM image), unprotect and delete the snapshot, and finally delete the “template” image:

[root@ceph1 ~]# rbd rm cloudstack/myVMvolume
Removing image: 100% complete...done.
[root@ceph1 ~]# rbd snap unprotect cloudstack/mytemplate@cloudstack-base-snap
[root@ceph1 ~]# rbd snap rm cloudstack/mytemplate@cloudstack-base-snap
Removing snap: 100% complete...done.
[root@ceph1 ~]# rbd rm cloudstack/mytemplate
Removing image: 100% complete...done.

As you can see from above, the “rbd” tool is the tool to use to manipulate RBD images – for many other examples, please see http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/

In the next sections, I will try to cover a bit of what to expect, dos and don’ts of using Ceph cluster – again, this is not meant to be a comprehensive guide, since CloudStack storage feature sets related to Ceph and also Ceph itself is constantly changing and improving.

Cloudstack with Ceph

CloudStack has supported Ceph for many years now and though in the early days, many of the CloudStack’s storage features were not on pair with NFS, with the wider adoption of Ceph as the Primary Storage solution for CloudStack, most of the missing features have been implemented. That being said, there are still some minor specifics with Ceph:

  • In contrast to NFS, which provides a file system, Ceph provides raw block devices for VMs and thus it’s not possible to make a full VM snapshot the way it can be done with NFS – i.e. there is no filesystem to which to write the content of the RAM memory or other metadata needed.
  • Continuing on above, since no file system, there is (currently) no way to write a heartbeat file the way it’s implemented on NFS (but some work is currently being planned around this)
  • Speaking of a simple volume snapshot, it’s currently not possible to really restore a volume from a snapshot (which became possible in 4.11 with NFS or SolidFire Managed Storage). But this is rather a simple thing to implement.
  • By using Ceph with KVM and CloudStack, there are two external libraries used – librbd (a user-space client, used by Qemu to speak to Ceph cluster) and rados-java (a Java wrapper for librados). Historically, there have been some issues/bugs with rados-java (though they have been resolved long time ago), so you should probably keep an eye on these two and make sure they are up to date.

Learning curve

Ceph can be considered a rather complex storage system to comprehend and it definitively has a steep learning curve. Even when using Ceph on its own (i.e. not with CloudStack), make sure you know the storage system well before relying on it in production and also make sure to be able to troubleshoot problematic situations when they arise. Anyone prepared to put in a bit of effort and planning can deploy a nice Ceph cluster, but it takes skills and a deeper understanding of how things work under the hood when it comes to troubleshooting unusual situations or how (for example) replacing failed disks (or nodes) can influence performance of the clients (in our case, CloudStack VMs). Based on my previous experience managing a production Ceph cluster, I would definitively strongly suggest taking the above recommendations very seriously. On the other hand, experimenting with such an advanced storage solution can be really interesting and rewarding.

Performance considerations

Ceph is a great distributed storage solution that performs very well under sequential IO workload, can scale indefinitely and has a very interesting architecture. However, as with any distributed storage solution, when you write data to a cluster, it actually takes time to write that data to first node, have that write operation replicated to other 2 nodes (replica size 3) and send back ACK to the client that the write is successful. In case you are using NVME devices, like some users in community, you can expect very low latencies in the ballpark of around 0.5ms, but in case you opt out for a HDD based solution (with journals on SSDs) as many users do when starting their Ceph journey, you can expect 10-30ms of latency, depending on many factors (cluster size, networking latency, SSD/journal latency, and so on). In other words, don’t expect miracles with commodity hardware – if something “works on commodity hardware”, that doesn’t mean it also performs well. Actually, until 2-3 years ago, Ceph was (unofficially) considered unsuitable for any more serious random IO workload, which is backed up by referenced guides published between major HW vendors and RedHat, where they clearly stated that Ceph is not the “best choice” when it comes to random IO. In contrast to sequential IO workload benchmarks (where Ceph actually shines), these reference guides were completely missing any benchmark data with random IO workload/pattern. That being said, with the introduction of BlueStore as the new storage backend in recent Ceph releases and with some more architectural changes, Ceph has become much more suitable for pure SSD (and NVME) clusters and performance has improved drastically.

I hope this Ceph article series has been useful and interesting and helped you get up and running. All in all Ceph is an interesting development in the storage space and can provide a true cloud storage solution if implemented correctly.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

In the previous article we covered some basics around Ceph and deployed a working Ceph cluster. In this article, we are going to finish the Ceph configuration needed for CloudStack and add it as a new Primary Storage pool. We are also going to deploy Ceph volumes via CloudStack and examine them. Finally, in part 3 (to be published soon), I will show you some examples of working with RBD images and will cover some Ceph specifics, both in general and related to the CloudStack.

Before proceeding with the actual work, let me first mention that CloudStack supports Ceph with KVM only, so most of the work we do below is KVM related. Let’s define the high-level steps to be done:

  • Create a dedicated RBD pool for CloudStack in which all RBD images (volumes) will be created
  • Create a dedicated authentication key for the previously created pool
  • Update / install required Ceph binaries on KVM nodes
  • Add Ceph as Primary Storage in CloudStack
  • Implement custom storage tag for Ceph Primary Storage
  • Create new Compute / Disk offerings with same storage tag in order to target Ceph

From any Ceph node…

Ceph groups RBD (RADOS block device) images in pools and manages authentication on a per pool level. Each image is collection of many RADOS objects, with each object having a default size of 4MB (configurable per image). At this moment we have no pools created. But before creating a pool, let’s go through some basics around the different kind of pools in Ceph.

There are 2 kind of pools, based on the way the objects are stored across cluster:

  • Replicated – makes sure that there are always total of N replicas/copies of an object
  • Erasure Coding – simplest way to think of this is a network RAID 5/6

Replicated pools are used for better performance at the expense of space consumption, and you can think of it as a network-based RAID 1, where we have n number of replicas of an object. On the other hand, erasure coding pools are usually used when using Ceph for S3 Object Storage purposes and for more space efficient storage where bigger latency and lower performance is acceptable, since it is similar to RAID 5 or RAID 6 (requires some computation power). Here, for example, we may have 4 chunks of actual data and 2 parity chunks (EC 4+2), with just 50% of space overhead, while (depending on the setup), we can still survive losing a Ceph node or even two.

So, let’s create a dedicated pool for CloudStack, set its replica size and finally initialize it:

ceph osd pool create cloudstack 64 replicated
ceph osd pool set cloudstack size 3
rbd pool init cloudstack

The commands above will create a replicated pool named “cloudstack” with total of 64 placement groups (more info on placement groups here) with a replica size of 3, which is recommended for a production cluster. Optionally, you can set replica size of 2 during testing, for somewhat increased performance and less space consumed on the cluster.

Next, let’s generate a dedicated authentication key for our CloudStack pool:

ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'

The command above will output a key to STDOUT only – please save the given key, since we will use it when adding Ceph to CloudStack later:

[client.cloudstack]
key = AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

Now that the pool for CloudStack is ready, we need to prepare KVM nodes with proper Ceph binaries as well as the write-back caching configuration.

From the Ceph admin node…

Starting from Centos 7.2  (and Ubuntu 14.04), libvirt / QEMU comes by default with support for RBD, so there’s no need to compile the binaries yourself. That being said, if we check KVM nodes with “rpm -qa | grep librbd1″ it will return an existing versions of ‘librbd1” package (version 10.2.5 in my case)  already installed, but most certainly it will not be the current version that corresponds to the cluster version we just installed (13.2.5 in this case). For the record, librbd is a user space Ceph client, to which the qemu / libvirt talks effectively.

Furthermore, if we run command “ceph features” from any Ceph node, it will return (in our fresh Mimic cluster) “luminous” as the minimum compatible release version for the client – that means that our Ceph client (librbd) needs to be of a minimum of “luminous” version (which translates to 12.2.0), but our current librbd version is 10.2.5 – so let’s upgrade it to same Mimic versions as the version of our cluster:

ceph-deploy install --cli kvm1  kvm2

The command above will add Mimic repo to my two KVM nodes and install only the cli binaries (“ceph-common” package). This will also trigger the upgrade of existing “librbd1” package to the correct version. In addition, please make sure that name resolution of the KVM nodes works from the Ceph admin node.

Optionally, if you don’t want to install Ceph cli tools on KVM nodes, you can just upgrade the “librbd1” package while having previously created a proper Ceph Mimic repository on each KVM node (i.e. clone repo file from any Ceph cluster node).

Some of you might want to be able to manage Ceph cluster from KVM nodes as well (beside being able to manage it from Ceph nodes) and to be able to interact with RBD images with via “rbd” or “qemu-img“ tools – in this case we need this “rbd” tool installed on KVM nodes (part of “ceph-common” package, already installed in previous step), then we need ceph.conf locally on KVM nodes in order for the “rbd” tool to know how to connect to cluster, which MONs to target, etc. and finally we need the admin authentication key – this is the file “ceph.client.admin.keyring” which was created on our Ceph admin node when we created our cluster initially (in folder /root/CEPH-CLUSTER, as mentioned in Part 1 of this article series).

Additionally, if we want to use qemu-img tool to examine RBD images, we can either have qemu-img installed on the Ceph cluster nodes or we have to provide the above mentioned ceph.conf and admin keys in their default location (/etc/ceph/) on the KVM nodes, where librbd (client) will pick them up automatically, so we don’t need to specify MON IP/URL and admin key on the command line.

If you don’t want to be able to manage your Ceph cluster from KVM nodes, simply don’t copy over the “ceph.client.admin.keyring” file to KVM nodes. The ceph.conf file is still a must due to RBD caching as explained later. I have decided to make my KVM nodes happy by providing them with ceph.conf and admin keys, as below:

ceph-deploy admin kvm1 kvm2

The command above will effectively just copy ceph.conf and ceph.client.admin.keyring files to /etc/ceph/ folder on KVM nodes. Actually, you can still operate RBD images and manage your cluster from KVM nodes even if you don’t have ceph.conf and admin key present locally – you can always pass required parameters on the command line to “rbd” or “qemu-img” tools, as shown later.

RBD caching

After we have pushed the ceph.conf file to KVM nodes, librbd will read it for any configuration directives under the “[client]” section of that file (beside the other sections), but that section is missing at this moment!

Before we proceed into configuring the RBD caching, let me do here a copy/paste from the original docs that is important to understand regarding RBD caching:

” The user space implementation of the Ceph block device (i.e., librbd) cannot take advantage of the Linux page cache, so it includes its own in-memory caching, called “RBD caching.” RBD caching behaves just like well-behaved hard disk caching. When the OS sends a barrier or a flush request, all dirty data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes (i.e. Linux kernel >= 2.6.32). The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can coalesce contiguous requests for better throughput. “

After digesting the above info, we can proceed into a brief configuration of caching. We can either fix it manually on each KVM node by adding the missing section in ceph.conf file, or we can do it in a more proper way by changing ceph.conf on the Ceph admin node and then pushing new file version to all KVM (and optionally Ceph cluster) nodes:

cat << EOM >> /root/CEPH-CLUSTER/ceph.conf
[client]
  rbd cache = true
  rbd cache writethrough until flush = true
EOM
 
ceph-deploy --overwrite-conf admin kvm1 kvm2

Please note the above “writethrough until flush = true”. This is a safety mechanism which will force writethrough cache mode until it receives the very first flush request from the VM OS (which means that the OS is sending proper flush requests to the underlying storage, i.e. kernel >= 2.6.32) and then cache mode will change to the write-back, which actually brings performance benefits.

If case you want to play more with RBD caching, please see here – where you can find some important default values which we didn’t explicitly configure i.e. default rbd cache size is 32 MB (this is per volume) – so in case of 50 VMs with 4 volumes each, that translates to 50 x 4 x 32MB =  6.4GB of additional RAM consumed on a KVM host – keep that in mind !

Finally, let’s add the Ceph to CloudStack as an additional Primary Storage – we can do it via GUI or optionally via CloudMonkey (API) as following:

 

Or via CloudMonkey:

create storagepool scope=zone zoneid=3c764ee1-6590-417d-b873-f073d0c550be hypervisor=KVM name=MyCephCluster provider=Defaultprimary url=rbd://cloudstack:AQAFSZpc0t-BIBAAO95rOl+jgRwuOopojEtr_g==@10.2.2.219/cloudstack tags=RBD

Most of the parameters are self-explanatory but let’s explain a few of them:

  • RADOS Monitor: This is the IP address (or DNS name) of the Ceph Monitor (MON) instance – in my case I have defined a very first MON instance (IP address of the Ceph1 node from my cluster) – but in production environment you will want to have an internal Round Robin DNS setup on some internal DNS server (i.e. single zone on Bind) – such that KVM nodes will resolve the ULR (i.e. mon.myceph.cluster) in a round robin fashion to multiple MON instances – this is the way to achieve high availability of Ceph MONs, though some manual DNS zone changes are needed in case of prolonged MON maintenance
  • RADOS Pool: This is the pool “cloudstack” which we created in the beginning of the article
  • RADOS User and RADOS Secret: This are the values from the authentication key which we generated in the beginning of the article, shown below again for your convenience

[client.cloudstack]
key = AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

The above command, used to add Ceph to CloudStack, will effectively do a few things:

  • On each KVM node, it will create a new storage pool in libvirt
  • The storage pool definition files (xml and the secret) will be written to /etc/libvirt/secrets/ folder as shown below
  • Every time CloudStack Agent is restarted, it will recreate the Ceph storage pool (even if you manually remove the files below)

[root@kvm1]# cat /etc/libvirt/secrets/ef9cfd17-abe1-343d-97a0-cee6c71a6dad.xml
<secret ephemeral='no' private='no'>
  <uuid>ef9cfd17-abe1-343d-97a0-cee6c71a6dad</uuid>
  <usage type='ceph'>
    <name>cloudstack@ceph1.local:6789/cloudstack</name>
  </usage>
</secret>

[root@kvm1]# cat /etc/libvirt/secrets/ef9cfd17-abe1-343d-97a0-cee6c71a6dad.base64
AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

If we check the libvirt pool created above, we can see that it’s not persistent and it doesn’t start automatically – i.e. when you restart libvirt alone, it will not recreate / start the Ceph storage pool in libvirt– the CloudStack agent is the one doing this for us:

virsh # pool-info ef9cfd17-abe1-343d-97a0-cee6c71a6dad
Name:           ef9cfd17-abe1-343d-97a0-cee6c71a6dad
UUID:           ef9cfd17-abe1-343d-97a0-cee6c71a6dad
State:          running
Persistent:     no
Autostart:      no
Capacity:       299.99 GiB
Allocation:     68.19 MiB
Available:      286.02 GiB

Note that in the example above, I was actually using DNS name for the Ceph MON (ceph1.local) instead of the IP – Ceph MON’s DNS name is resolved to IP both when you add Ceph to CloudStack and every time you start a VM or attach new volume, etc. – so DNS resolution needs to be fast and stable here.

Now that we added Ceph to CloudStack, let’s create a Data disk offering with tag “RBD” – this will make sure that any new volume from this offering is created on storage pool with tag “RBD” – which is Ceph in our case . Here, we are using storage tags to avoid messing up with your existing CloudStack installation – but it’s not required otherwise:

(localcloud) SBCM5> > create diskoffering name=5GB-Ceph displaytext=5GB-Ceph storagetype=shared provisioningtype=thin customized=false disksize=5 tags=RBD
{
  "diskoffering": {
    "created": "2019-03-26T19:27:32+0000",
    "disksize": 5,
    "displayoffering": true,
    "displaytext": "5GB-Ceph",
    "id": "2c74becc-c39d-4aa8-beec-195b351bdaf0",
    "iscustomized": false,
    "name": "5GB-Ceph",
    "provisioningtype": "thin",
    "storagetype": "shared",
    "tags": "RBD"
  }
}

Note the offering ID from above (2c74becc-c39d-4aa8-beec-195b351bdaf0) – and let’s create a disk from it:

(localcloud) SBCM5> > create volume diskofferingid=2c74becc-c39d-4aa8-beec-195b351bdaf0 name=MyFirstCephDisk zoneid=3c764ee1-6590-417d-b873-f073d0c550be
{
  "volume": {
    "account": "admin",
    "created": "2019-03-26T19:52:05+0000",
    "destroyed": false,
    "diskofferingdisplaytext": "5GB-Ceph",
    "diskofferingid": "2c74becc-c39d-4aa8-beec-195b351bdaf0",
    "diskofferingname": "5GB-Ceph",
    "displayvolume": true,
    "domain": "ROOT",
    "domainid": "401ce404-44c1-11e9-96c5-1e009001076e",
    "hypervisor": "None",
    "id": "47b1cfe5-6bab-4506-87b6-d85b77d9b69c",
    "isextractable": true,
    "jobid": "49a682ab-42f9-4974-8e42-452a13c97553",
    "jobstatus": 0,
    "name": "MyFirstCephDisk",
    "provisioningtype": "thin",
    "quiescevm": false,
    "size": 5368709120,
    "state": "Allocated",
    "storagetype": "shared",
    "tags": [],
    "type": "DATADISK",
    "zoneid": "3c764ee1-6590-417d-b873-f073d0c550be",
    "zonename": "ref-trl-1019-k-M7-apanic"
  }
}

Finally, since volume creation is a lazy provisioning process (i.e. volume is created in DB only, not really on storage pool), let’s attach the disk to a running VM (using volume ID “47b1cfe5-6bab-4506-87b6-d85b77d9b69c” from previous command output), which will trigger the actual disk creation on our Ceph cluster (output shortened for brevity):

(localcloud) SBCM5> > attach volume id=47b1cfe5-6bab-4506-87b6-d85b77d9b69c virtualmachineid=19a67e20-c747-43bb-b149-c2b2294002f9
{
  "volume": {
    …
    "jobstatus": 0,
    "name": "MyFirstCephDisk",
    "path": "47b1cfe5-6bab-4506-87b6-d85b77d9b69c",
    …  }
}

Note the “path” output field (which is usually the same as the ID of the volume, except in some special cases) – and let’s check our Ceph cluster if we can find this volume and check it’s properties.

From any KVM node…

[root@kvm1 ~]# rbd ls -p cloudstack
47b1cfe5-6bab-4506-87b6-d85b77d9b69c
 
[root@kvm1 ~]# rbd info cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
rbd image '47b1cfe5-6bab-4506-87b6-d85b77d9b69c':
        size 5 GiB in 1280 objects
        order 22 (4 MiB objects)
        id: d43b4c04a8af
        block_name_prefix: rbd_data.d43b4c04a8af
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Tue Mar 28 19:46:32 2019

We can also examine the Ceph RBD image with qemu-img tool:

[root@kvm1 ~]# qemu-img info rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
image: rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
file format: raw
virtual size: 5.0G (5368709120 bytes)
disk size: unavailable

As you can see in qemu-img command above, we did not specify any username and authentication keys, because we have our ceph.conf and the admin key files present in /etc/ceph/ folder. If you decided to opt-out of having these 2 files present on KMV nodes, you will have to use a cumbersome command as below:

qemu-img info rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c:mon_host=10.2.2.219:auth_supported=Cephx:id=cloudstack:key=AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

In the above command we are specifying the MON IP address, username and key for authentication.

Now that you got the basics of consuming Ceph from CloudStack, feel free to also create Compute Offerings and System Offerings for Virtual Routers, Secondary Storage VM, Console Proxy VM and experiment with volume migration from i.e. NFS to Ceph. Be sure to have storage tags under control.

I hope that this article series has been interesting so far. In part 3 (which will be the final part), I will show you some examples of working with RBD images and will cover some Ceph specifics, both in general and related to CloudStack.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

As well as NFS and various block storage solutions for Primary Storage, CloudStack has supported Ceph with KVM for a number of years now. Thanks to some great Ceph users in the community lots of previously missing CloudStack storage features have been implemented for Ceph (and lots of bugs squashed), making it the perfect choice for CloudStack if you are looking for easy scaling of storage and decent performance.

In this and my next article, I am going to cover all steps needed to actually install a Ceph cluster from scratch, and subsequently add it to CloudStack. In this article I will cover installation and basic configuration of a standalone Ceph cluster, whilst in part 2 I will go into creating a pool for a CloudStack installation, adding Ceph to CloudStack as an additional Primary Storage and creating Compute and Disk offerings for Ceph. In part 3, I will also try to explain some of the differences between Ceph and NFS, both from architectural / integration point of view, as well as when it makes sense (or doesn’t) to use it as the Primary Storage solution.

It is worth mentioning that the Ceph cluster we build in this first article can be consumed by any RBD client (not just CloudStack). Although in part 2 we move onto integrating your new Ceph cluster into CloudStack, this article is about creating a standalone Ceph cluster – so you are free to experiment with Ceph.

Firstly, I would like to share some high-level recommendations from very experienced community members, who have been using Ceph with CloudStack for a number of years:

  • Make sure that your production cluster is at least 10 nodes so as to minimize any impact on performance during data rebalancing (in case of disk or whole node failure). Having to rebalance 10% of data has a much smaller impact (and duration) than having to rebalance 33% of data; another reason is improved performance as data is distributed across more drives and thus read / write performance is better
  • Use 10GB networking or faster – a separate network for client and replication traffic is needed for optimal performance
  • Don’t rely on cache tiering, unless you have a very specific IO pattern / use case. Moving data in and out of cache tier can quickly create a bottleneck and do more harm than good
  • If running an older version of Ceph cluster (eg. FileStore based OSD), you will probably place your journals on SSDs. If so, make sure that you properly benchmark SSD for the synchronous IO write performance (Ceph writes to journal devices with O_DIRECT and D_SYNC flags). Don’t try to put too many journals on single SSD; consumer grade SSDs are unacceptable, since their synchronous write performance is usually extremely bad and they have proven to be exceptionally unreliable when used in a Ceph cluster as journal device

Before we continue, let me state that this first article is NOT meant to be a comprehensive guide on Ceph history, theory, installation or optimization, but merely a simple step-by-step guide for a basic installation, just to get us going. Still, in order to be able to better follow the article, it’s good to define some basics around Ceph architecture.

Ceph has a couple of different components and daemons, which serves different purposes, so let’s mention some of these (relevant for our setup):

  • OSD (Object Storage Daemon) – usually maps to a single drive (HDD, SDD, NVME) and it’s the one containing user data. As can be concluded from it’s name, there is a Linux process for each OSD running in a node. A node hosting only OSDs can be considered as a Storage or OSD node in Ceph’s terminology.
  • MON (Monitor daemon) – holds the cluster map(s), which provides to Ceph Clients and Ceph OSD Daemons with the knowledge of the cluster topology. To clarify this further, in the heart of Ceph is the CRUSH algorithm, which makes sure that OSDs and clients can calculate the location of specific chunk of data in the cluster (and connect to specific OSDs for read/write of data), without a need to read it’s position from somewhere (as opposite to a regular file systems which have pointers to the actual data location on a partition).

A couple of other things are worth mentioning:

  • For cluster redundancy, it’s required to have multiple Ceph MONs installed, always aiming for an odd number to avoid a chance of split-brain scenario. For smaller clusters, these could be placed on VMs or even collocated with other Ceph roles (i.e. OSD nodes), though busier clusters will need a dedicated, powerful servers/VMs. In contrast to OSDs, there can be only one MON instance per server/VM.
  • For improved performance, you might want to place MON’s database (LevelDB) on dedicated SSDs (versus the defaults of being placed on OS partition).
  • There are two ways that OSDs can manage the data they store. Starting with the Luminous 12.2.z release, the new default (and recommended) backend is BlueStore. Prior to Luminous, the default (and only option) was FileStore. With FileStore, data is first written to a Journal (which can be collocated with the OSD on same device or it can be a completely separate partition on a faster, dedicated device) and then later committed to OSD. With BlueStore, there is no true Journal per se, but a RocksDB key/value database (for managing OSD’s internal metadata). FileStore OSD will use XFS on top of it’s partition, while BlueStore write data directly to raw device, without a need for a file system. With it’s new architecture, BlueStore brings big speed improvement over FileStore.
  • When building and operating a cluster, you will probably want to have a dedicated server/VM used as the deployment or admin node. This node will host your deployment tools (be it a basic ceph-deploy tool or a full blown ansible playbook), as well as cluster definition and configuration files, which can be changed on central place (this node) and then pushed to cluster nodes as required.

Armed with above knowledge (and against all recommendations given previously) we are going to deploy a very minimalistic installation of Ceph cluster on top of 3 servers (VMs), with 1 volume per node being dedicated for an OSD daemon, and Ceph MONs collocated with the Operating System on the system volume. The reason for choosing such a minimalistic setup is the ability to quickly build a test cluster on top of 3 VMs (which most people will do when building their very first Ceph cluster) and to keep configuration as short as possible. Remember, we just want to be able to consume Ceph from CloudStack, and currently don’t care about performance or uptime / redundancy (beside some basic things, which we will cover explicitly).

Our setup will be as following:

  • We will already have a working CloudStack 4.11.2 installation (i.e. we expect you to have a working CloudStack installation)
  • We will add Ceph storage as an additional Primary Storage to CloudStack and create offerings for it
  • CloudStack Management Server will be used as Ceph admin (deployment) node
  • Management Server and KVM nodes details:
    • CloudStack Management Server: IP 10.2.2.118
    • KVM host1: IP 10.2.3.135, hostname “kvm1”
    • KVM host2: IP 10.2.2.208, hostname “kvm2”
  • Ceph nodes details (dedicated nodes):
    • 2 CPU, 4GB RAM, OS volume 20GB, DATA volume 100GB
    • Single NIC per node, attached to the CloudStack Management Network – i.e. there is no dedicated network for Primary Storage traffic between our KVM hosts and the Ceph nodes
    • Node1: IP 10.2.2.119, hostname “ceph1”
    • Node2: IP 10.2.2.116, hostname “ceph2”
    • Node3: IP 10.2.3.159, hostname “ceph3”
    • Single OSD (100GB) running on each node
    • MON instance running on each node
    • Ceph Mimic (13.latest) release
    • All nodes will be running latest CentOS 7 release, with default QEMU and Libvirt versions on KVM nodes

As stated above Ceph admin (deployment) node will be on CloudStack Management Server, but as you can guess, you can use a dedicated VM/Server for this purpose as well.

Before proceeding with the actual work, let’s define the high-level steps required to deploy a working Ceph cluster

  • Building the Ceph cluster:
    • Setting time synchronization, host name resolution and password-less login
    • Setting up firewall and SELinux
    • Creating a cluster definition file and auth keys on the deployment node
    • Installation of binaries on cluster nodes
    • Provisioning of MON daemons
    • Copying over the ceph.conf and admin keys to be able to manage the cluster
    • Provisioning of Ceph manager daemons (Ceph Dashboard)
    • Provisioning of OSD daemons
    • Basic configuration

We will cover configuration of KVM nodes in second article.

Let’s start!

On all nodes…

It is critical that the time is properly synchronized across all nodes. If you are running on hypervisor, your VMs might already be synced with the host, otherwise do it the old-fashioned way:

ntpdate -s time.nist.gov
yum install ntp
systemctl enable ntpd
systemctl start ntpd

Make sure each node can resolve the name of each other node –  if not using DNS, make sure to populate /etc/hosts file properly across all 4 nodes (including admin node):

cat << EOM >> /etc/hosts
10.2.2.219 ceph1
10.2.2.116 ceph2
10.2.3.159 ceph3
EOM

On CEPH admin node…

We start by installing ceph-deploy, a tool which we will use to deploy our whole cluster later:

release=mimic
cat << EOM > /etc/yum.repos.d/ceph.repo
[ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-$release/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
EOM
 
yum install ceph-deploy -y

Let’s enable password-less login for root account – generate SSH keys and seed public key into /root/.ssh/authorized_keys file on all Ceph nodes (in production environment, you might want to use a user with limited privileges with sudo escalation):

ssh-keygen -f $HOME/.ssh/id_rsa -t rsa -N ''
ssh-copy-id root@ceph1
ssh-copy-id root@ceph2
ssh-copy-id root@ceph3

On all CEPH nodes…

Before beginning, ensure that SELINUX is set to permissive mode and verify that firewall is not blocking required connections between Ceph components:

firewall-cmd --zone=public --add-service=ceph-mon --permanent
firewall-cmd --zone=public --add-service=ceph --permanent
firewall-cmd --reload
setenforce 0

Make sure that you make SELINUX changes permanent, by editing /etc/selinux.config and setting ‘SELINUX=permissive’

As for the firewall, in case you are using different distribution or don’t consume firewalld, please refer to the networking configuration reference at http://docs.ceph.com/docs/mimic/rados/configuration/network-config-ref/

On CEPH admin node…

Let’s create cluster definition locally on admin node:

mkdir CEPH-CLUSTER; cd CEPH-CLUSTER/
ceph-deploy new ceph1 ceph2 ceph3

This will trigger a ssh connection to each of above referenced Ceph nodes (to check for machine platform and IP addresses) and will then write a local cluster definition and the MON auth key in the current folder.  Let’s check the files generated:

# ls -la
-rw-r--r-- ceph.conf
-rw-r--r-- ceph-deploy-ceph.log
-rw------- ceph.mon.keyring

On Centos7, if you get the “ImportError: No module named pkg_resources” error message while running ceph-deploy tool, you might need to install missing packages:

yum install python-setuptools

In case that you have multiple network interfaces on Ceph nodes, you will be required to explicitly define public network (which accepts client’s connections) – in this case edit previously created ceph.conf on the local admin node to include public network setting:

echo "public network = 10.2.0.0/16" >> ceph.conf

If you only have one NIC in each Ceph node, the above line is not required.

Still on admin node, let’s start the installation of Ceph binaries across cluster nodes (no services started yet):

 ceph-deploy install ceph1 ceph2 ceph3 

Command above will also output the version of Ceph binaries installed on each node – make sure that you did not get a wrong Ceph version installed due to some other repos present (we are installing Mimic 13.2.5, which is latest as of the time of writing).

Let’s create (initial) MONs on all 3 Ceph nodes:

ceph-deploy mon create-initial

In order to be able to actually manage our Ceph cluster, let’s copy over the admin key and the ceph.conf files to all Ceph nodes:

ceph-deploy admin ceph1 ceph2 ceph3

On any CEPH node…

After previous step, you should be able to issue “ceph -s” from any Ceph node, and this will return the cluster health. If you are lucky enough, your cluster will be in HEALTH_OK state, but it might happen that your MON daemons will complain on time mismatch between the nodes, as following:

[root@ceph1 ~]# ceph -w
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_WARN
            clock skew detected on mon.ceph1, mon.ceph3 

In this case, we should stop NTP daemon, force time update (a few times), and start NTP daemon again – and after doing this across all nodes, it would be required to restart Ceph monitors on each node, one by one (give it a few seconds between restart on different nodes) – below we are restarting all Ceph daemons – which effectively means just MONs since we deployed only MONs so far:

systemctl stop ntpd
ntpdate -s time.nist.gov; ntpdate -s time.nist.gov; ntpdate -s time.nist.gov
systemctl start ntpd
systemctl restart ceph.target

After time has been properly synchronized (with less then 0.05 seconds of time difference between the nodes), you should be able to see a cluster in HEALTH_OK state, as below:

[root@ceph1 ~]# ceph -s
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_OK

On CEPH admin node…

Now that we are up and running with all Ceph monitors, let’s deploy Ceph manager daemon (Ceph dashboard, that comes with newer releases) on all nodes since they operate in active/standby configuration (we will configure it later):

ceph-deploy mgr create ceph1 ceph2 ceph3

Finally, let’s deploy some OSDs so our cluster can actually hold some data eventually:

ceph-deploy osd create --data /dev/sdb ceph1
ceph-deploy osd create --data /dev/sdb ceph2
ceph-deploy osd create --data /dev/sdb ceph3

Note in commands above, we reference /dev/sdb as the 100GB volume that is used for OSD.

As mentioned previously, newer versions of Ceph (as in our case) will use by default BlueStore as the storage backend, with (by default) collocating block data and RocksDB key/value database (for managing its internal metadata) on the same device (/dev/sdb in our case). In more complex setups, one can choose to separate RockDB DB on faster devices, while block data will remain on slower devices – somewhat similar with the older FileStore setups, where block data would be located on HDDs/SSDs devices, while Journals would be usually placed on SSD/NVME partitions.

On any CEPH node…

After previous step is done, we should get the output similar to below – confirming that we have a 300GB of space available:

[root@ceph1 ~]# ceph -s
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph2,ceph1,ceph3
    mgr: ceph1(active)
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   3.0 GiB used, 297 GiB / 300 GiB avail
    pgs:  

Finally, let’s enable the Dashboard manager and set the username/password for authentication (which will be encrypted and stored in monitor’s DB) to be able to access it.
In our lab, we will disable SSL connections and keep it simple – but obviously in production environment, you would want to force SSL connections and also install proper SSL certificate:

ceph config set mgr mgr/dashboard/ssl false
ceph mgr module enable dashboard
ceph dashboard set-login-credentials admin password

Let’s login to the Dashboard manager on the active node (ceph1 in our case, as can be seen in the output from “ceph -s” command above):

And there you go – you now have a working Ceph cluster, which concludes part 1 of this Ceph article series. In part 2 (published soon), we will continue our work by creating a dedicated RBD pool and authentication keys for our CloudStack installation, add Ceph to CloudStack, finally consuming it with dedicated Compute / Disk offerings.

It’s worth mentioning that Ceph itself does provided additional services – i.e. it supports S3 object storage (requires installation / configuration of Ceph Object Gateway) as well as POSIX-compliant file system CephFS (requires installation/configuration of Metadata Server), but for CloudStack, we only need Rados Block Device (RBD) services from Ceph.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

Introduction

This blog describes a new feature to be introduced in the CloudStack 4.12 release (already in the current master branch of the CloudStack repository). This feature will provide support for the Data Plane Development Kit (DPDK) in conjunction with Open vSwitch (OVS) for guest VMs and is targeted at the KVM hypervisor.

The Data Plane Development Kit (https://www.dpdk.org/) is a set of libraries and NIC drivers for fast package processing in userspace. Using DPDK along with OVS brings benefits to networking performance on VMs and networking appliances. In this blog, we will introduce how DPDK can be used on on guest VMs once the feature is released.

Please note – DPDK support in CloudStack requires that the KVM hypervisor is running on DPDK compatible hardware.

Enable DPDK support

This feature extends the Open vSwitch feature in CloudStack with DPDK integration. As a prerequisite, Open vSwitch needs to be installed on KVM hosts and enabled in CloudStack. In addition, administrators need to install DPDK libraries on KVM hosts before configuring the CloudStack agents, and I will go into the configuration in detail.

KVM Agent Configuration

An administrator can follow this guide to enable DPDK on a KVM host:

Prerequisites

  • Install OVS on the target KVM host
  • Configure CloudStack agent by editing the /etc/cloudstack/agent/agent.properties file:
    • # network.bridge.type=openvswitch
      # libvirt.vif.driver=com.cloud.hypervisor.kvm.resource.OvsVifDriver
      
  • Install DPDK. Installation guide can be found on this link: http://docs.openvswitch.org/en/latest/intro/install/dpdk/.

Configuration

Edit /etc/cloudstack/agent/agent.properties file, in which <OVS_PATH> is the path in which your OVS ports are created, typically /var/run/openvswitch/:

  • # openvswitch.dpdk.enable=true
    # openvswitch.dpdk.ovs.path=<OVS_PATH>
    

Restart CloudStack agent so that changes take effect:

# systemctl restart cloudstack-agent

DPDK inside guest VMs

Now that CloudStack agents have been configured, users are able to deploy their guest VMs using DPDK. In order to achieve this, they will need to pass extra configurations to enable DPDK:

  • Enable “HugePages” on the VM
  • NUMA node configuration

As of 4.12, passing extra configurations to VM deployments will be allowed. In the case of KVM, the extra configurations are added to the VM XML domain. The CloudStack API methods deployVirtualMachine and updateVirtualMachinewill support the new optional parameter extraconfigand will work in the following way:

 
# deploy virtualmachine ... extraconfig=<URL_UTF-8_ENCODED_CONFIGS>

CloudStack will expect a URL UTF-8 encoded string which can support multiple extra configurations. For example, if a user wants to enable DPDK, they will need to pass two extra configurations as we have mentioned above. An example of both configurations are the following:

 
dpdk-hugepages:
<memoryBacking> 
   <hugePages/> 
</memoryBacking> 

dpdk-numa: 
<cpu mode='host-passthrough'>
   <numa>
      <cell id='0' cpus='0' memory='9437184' unit='KiB' memAccess='shared'/>
   </numa> 
</cpu>

…which becomes this URL UTF-8 encoded string, and is the one that CloudStack will expect on VM deployments:

 
dpdk-hugepages%3A%20%3CmemoryBacking%3E%20%3ChugePages%2F%3E%20%3C%2FmemoryBacking%3E%20dpdk-numa%3A%20%3Ccpu%20mode%3D%22host-passthrough%22%3E%20%3Cnuma%3E%20%3Ccell%20id%3D%220%22%20cpus%3D%220%22%20memory%3D%229437184%22%20unit%3D%22KiB%22%20memAccess%3D%22shared%22%2F%3E%20%3C%2Fnuma%3E%20%3C%2Fcpu%3E

KVM networking verification

Administrators can verify how OVS ports are created with DPDK support on DPDK enabled hosts, in which users have deployed DPDK enabled guest VMs. These port names start with “csdpdk”:

 

# ovs-vsctl show
....
Port "csdpdk-1"
   tag: 30
   Interface "csdpdk-1"
      type: dpdkvhostuser
Port "csdpdk-4"
   tag: 30
   Interface "csdpdk-4"
      type: dpdkvhostuser

About the author

Nicolas Vazquez is a Senior Software Engineer at ShapeBlue, the Cloud Specialists, and is a committer in the Apache CloudStack project. Nicolas spends his time designing and implementing features in Apache CloudStack.

Introduction

We published the original blog post on KVM networking in 2016 – but in the meantime we have moved on a generation in CentOS and Ubuntu operating systems, and some of the original information is therefore out of date. In this revisit of the original blog post we cover new configuration options for CentOS 7.x as well as Ubuntu 18.04, both of which are now supported hypervisor operating systems in CloudStack 4.11. Ubuntu 18.04 has replaced the legacy networking model with the new Netplan implementation, and this does mean different configuration both for linux bridge setups as well as OpenvSwitch.

KVM hypervisor networking for CloudStack can sometimes be a challenge, considering KVM doesn’t quite have the same mature guest networking model found in the likes of VMware vSphere and Citrix XenServer. In this blog post we’re looking at the options for networking KVM hosts using bridges and VLANs, and dive a bit deeper into the configuration for these options. Installation of the hypervisor and CloudStack agent is pretty well covered in the CloudStack installation guide, so we’ll not spend too much time on this.

Network bridges

On a linux KVM host guest networking is accomplished using network bridges. These are similar to vSwitches on a VMware ESXi host or networks on a XenServer host (in fact networking on a XenServer host is also accomplished using bridges).

A KVM network bridge is a Layer-2 software device which allows traffic to be forwarded between ports internally on the bridge and the physical network uplinks. The traffic flow is controlled by MAC address tables maintained by the bridge itself, which determine which hosts are connected to which bridge port. The bridges allow for traffic segregation using traditional Layer-2 VLANs as well as SDN Layer-3 overlay networks.

KVMnetworking41

Linux bridges vs OpenVswitch

The bridging on a KVM host can be accomplished using traditional linux bridge networking or by adopting the OpenVswitch back end. Traditional linux bridges have been implemented in the linux kernel since version 2.2, and have been maintained through the 2.x and 3.x kernels. Linux bridges provide all the basic Layer-2 networking required for a KVM hypervisor back end, but it lacks some automation options and is configured on a per host basis.

OpenVswitch was developed to address this, and provides additional automation in addition to new networking capabilities like Software Defined Networking (SDN). OpenVswitch allows for centralised control and distribution across physical hypervisor hosts, similar to distributed vSwitches in VMware vSphere. Distributed switch control does require additional controller infrastructure like OpenDaylight, Nicira, VMware NSX, etc. – which we won’t cover in this article as it’s not a requirement for CloudStack.

It is also worth noting Citrix started using the OpenVswitch backend in XenServer 6.0.

Network configuration overview

For this example we will configure the following networking model, assuming a linux host with four network interfaces which are bonded for resilience. We also assume all switch ports are trunk ports:

  • Network interfaces eth0 + eth1 are bonded as bond0.
  • Network interfaces eth2 + eth3 are bonded as bond1.
  • Bond0 provides the physical uplink for the bridge “cloudbr0”. This bridge carries the untagged host network interface / IP address, and will also be used for the VLAN tagged guest networks.
  • Bond1 provides the physical uplink for the bridge “cloudbr1”. This bridge handles the VLAN tagged public traffic.

The CloudStack zone networks will then be configured as follows:

  • Management and guest traffic is configured to use KVM traffic label “cloudbr0”.
  • Public traffic is configured to use KVM traffic label “cloudbr1”.

In addition to the above it’s important to remember CloudStack itself requires internal connectivity from the hypervisor host to system VMs (Virtual Routers, SSVM and CPVM) over the link local 169.254.0.0/16 subnet. This is done over a host-only bridge “cloud0”, which is created by CloudStack when the host is added to a CloudStack zone.

 

KVMnetworking42

Linux bridge configuration – CentOS

In the following CentOS example we have changed the NIC naming convention back to the legacy “eth0” format rather than the new “eno16777728” format. This is a personal preference – and is generally done to make automation of configuration settings easier. The configuration suggested throughout this blog post can also be implemented using the new NIC naming format.

Across all CentOS versions the “NetworkManager” service is also generally disabled, since this has been found to complicate KVM network configuration and cause unwanted behaviour:

 
# systemctl stop NetworkManager
# systemctl disable NetworkManager

To enable bonding and bridging CentOS 7.x requires the modules installed / loaded:

 
# modprobe --first-time bonding
# yum -y install bridge-utils

If IPv6 isn’t required we also add the following lines to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

In CentOS the linux bridge configuration is done with configuration files in /etc/sysconfig/network-scripts/. Each of the four individual NIC interfaces are configured as follows (eth0 / eth1 / eth2 / eth3 are all configured the same way). Note there is no IP configuration against the NICs themselves – these purely point to the respective bonds:

# vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
NAME=eth0
TYPE=Ethernet
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
HWADDR=00:0C:12:xx:xx:xx
NM_CONTROLLED=no

The bond configurations are specified in the equivalent ifcfg-bond scripts and specify bonding options as well as the upstream bridge name. In this case we’re just setting a basic active-passive bond (mode=1) with up/down delays of zero and status monitoring every 100ms (miimon=100). Note there are a multitude of bonding options – please refer to the CentOS / RedHat official documentation to tune these to your specific use case.

# vi /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
NAME=bond0
TYPE=Bond
BRIDGE=cloudbr0
ONBOOT=yes
NM_CONTROLLED=no
BONDING_OPTS="mode=active-backup miimon=100 updelay=0 downdelay=0"

The same goes for bond1:

# vi /etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE=bond1
NAME=bond1
TYPE=Bond
BRIDGE=cloudbr1
ONBOOT=yes
NM_CONTROLLED=no
BONDING_OPTS="mode=active-backup miimon=100 updelay=0 downdelay=0"

Cloudbr0 is configured in the ifcfg-cloudbr0 script. In addition to the bridge configuration we also specify the host IP address, which is tied directly to the bridge since it is on an untagged VLAN:

# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr0
DEVICE=cloudbr0
ONBOOT=yes
TYPE=Bridge
IPADDR=192.168.100.20
NETMASK=255.255.255.0
GATEWAY=192.168.100.1
NM_CONTROLLED=no
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=no
DELAY=0

Cloudbr1 does not have an IP address configured hence the configuration is simpler:

# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr1
DEVICE=cloudbr1
ONBOOT=yes
TYPE=Bridge
BOOTPROTO=none
NM_CONTROLLED=no
DELAY=0
DEFROUTE=no
IPV4_FAILURE_FATAL=no
IPV6INIT=no

Optional tagged interface for storage traffic

If a dedicated VLAN tagged IP interface is required for e.g. storage traffic this can be accomplished by created a VLAN on top of the bond and tying this to a dedicated bridge. In this case we create a new bridge on bond0 using VLAN 100:

# vi /etc/sysconfig/network-scripts/ifcfg-bond.100
DEVICE=bond0.100
VLAN=yes
BOOTPROTO=none
ONBOOT=yes
TYPE=Unknown
BRIDGE=cloudbr100

The bridge can now be configured with the desired IP address for storage connectivity:

# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr100
DEVICE=cloudbr100
ONBOOT=yes
TYPE=Bridge
VLAN=yes
IPADDR=10.0.100.20
NETMASK=255.255.255.0
NM_CONTROLLED=no
DELAY=0

Internal bridge cloud0

When using linux bridge networking there is no requirement to configure the internal “cloud0” bridge, this is all handled by CloudStack.

Network startup

Note – once all network startup scripts are in place and the network service is restarted you may lose connectivity to the host if there are any configuration errors in the files, hence make sure you have console access to rectify any issues.

To make the configuration live restart the network service:

# systemctl restart network

To check the bridges use the brctl command:

# brctl show
bridge name bridge id STP enabled interfaces
cloudbr0 8000.000c29b55932 no bond0
cloudbr1 8000.000c29b45956 no bond1

The bonds can be checked with:

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:xx:xx:xx:xx
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:xx:xx:xx:xx
Slave queue ID: 0

Linux bridge configuration – Ubuntu

With the 18.04 “Bionic Beaver” release Ubuntu have retired the legacy way of configuring networking through /etc/network/interfaces in favour of Netplan – https://netplan.io/reference. This changes how networking is configured – although the principles around bridge configuration are the same as in previous Ubuntu versions.

First of all ensure correct hostname and FQDN are set in /etc/hostname and /etc/hosts respectively.

To stop network bridge traffic from traversing IPtables / ARPtables also add the following lines to /etc/sysctl.conf, this prevents bridge traffic from traversing IPtables / ARPtables on the host.

# vi /etc/sysctl.conf
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

Ubuntu 18.04 installs the “bridge-utils” and bridge/bonding kernel options by default, and the corresponding modules are also loaded by default, hence there are no requirements to add anything to /etc/modules.

In Ubuntu 18.04 all interface, bond and bridge configuration are configured using cloud-init and the Netplan configuration in /etc/netplan/XX-cloud-init.yaml. Same as for CentOS we are configuring basic active-passive bonds (mode=1) with status monitoring every 100ms (miimon=100), and configuring bridges on top of these. As before the host IP address is tied to cloudbr0:

# vi /etc/netplan/50-cloud-init.yaml
network:
    ethernets:
        eth0:
            dhcp4: no
        eth1:
            dhcp4: no
        eth2:
            dhcp4: no
        eth3:
            dhcp4: no
    bonds:
        bond0:
            dhcp4: no
            interfaces:
                - eth0
                - eth1
            parameters:
                mode: active-backup
                primary: eth0
        bond1:
            dhcp4: no
            interfaces:
                - eth2
                - eth3
            parameters:
                mode: active-backup
                primary: eth2
    bridges:
        cloudbr0:
            addresses:
                - 192.168.100.20/24
            gateway4: 192.168.100.1
            nameservers:
                search: [mycloud.local]
                addresses: [192.168.100.5,192.168.100.6]
            interfaces:
                - bond0
        cloudbr1:
            dhcp4: no
            interfaces:
                - bond1
    version: 2

Optional tagged interface for storage traffic

To add an options VLAN tagged interface for storage traffic add a VLAN and a new bridge to the above configuration:

# vi /etc/netplan/50-cloud-init.yaml
    vlans:
        bond100:
            id: 100
            link: bond0
            dhcp4: no
    bridges:
        cloudbr100:
            addresses:
               - 10.0.100.20/24
            interfaces:
               - bond100

Internal bridge cloud0

When using linux bridge networking the internal “cloud0” bridge is again handled by CloudStack, i.e. there’s no need for specific configuration to be specified for this.

Network startup

Note – once all network startup scripts are in place and the network service is restarted you may lose connectivity to the host if there are any configuration errors in the files, hence make sure you have console access to rectify any issues.

To make the configuration reload Netplan with

# netplan apply

To check the bridges use the brctl command:

# brctl show
bridge name	bridge id		STP enabled	interfaces
cloud0		8000.000000000000	no
cloudbr0	8000.52664b74c6a7	no		bond0
cloudbr1	8000.2e13dfd92f96	no		bond1
cloudbr100	8000.02684d6541db	no		bond100

To check the VLANs and bonds:

# cat /proc/net/vlan/config
VLAN Dev name | VLAN ID
Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
bond100 | 100 | bond0
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 10
Permanent HW addr: 00:0c:xx:xx:xx:xx
Slave queue ID: 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 10
Permanent HW addr: 00:0c:xx:xx:xx:xx
Slave queue ID: 0

 

OpenVswitch bridge configuration – CentOS

The OpenVswitch version in the standard CentOS repositories is relatively old (version 2.0). To install a newer version either locate and install this from a third party CentOS/Fedora/RedHat repository, alternatively download and compile the packages from the OVS website http://www.openvswitch.org/download/ (notes on how to compile the packages can be found in http://docs.openvswitch.org/en/latest/intro/install/fedora/).

Once packages are available install and enable OVS with

# yum localinstall openvswitch-<version>.rpm
# systemctl start openvswitch
# systemctl enable openvswitch

In addition to this the bridge module should be blacklisted. Experience has shown that even blacklisting this module does not prevent it from being loaded. To force this set the module install to /bin/false. Please note the CloudStack agent install depends on the bridge module being in place, hence this step should be carried out after agent install.

echo "install bridge /bin/false" > /etc/modprobe.d/bridge-blacklist.conf

As with linux bridging above the following examples assumes IPv6 has been disabled and legacy ethX network interface names are used. In addition the hostname has been set in /etc/sysconfig/network and /etc/hosts.

Add the initial OVS bridges using the ovs-vsctl toolset:

# ovs-vsctl add-br cloudbr0
# ovs-vsctl add-br cloudbr1
# ovs-vsctl add-bond cloudbr0 bond0 eth0 eth1
# ovs-vsctl add-bond cloudbr1 bond1 eth2 eth3

This will configure the bridges in the OVS database, but the settings will not be persistent. To make the settings persistent we need to configure the network configuration scripts in /etc/sysconfig/network-scripts/, similar to when using linux bridges.

Each individual network interface has a generic configuration – note there is no reference to bonds at this stage. The following ifcfg-eth script applies to all interfaces:

# vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
BOOTPROTO=none
NAME=eth0
ONBOOT=yes
NM_CONTROLLED=no
HOTPLUG=no
HWADDR=00:0C:xx:xx:xx:xx

The bonds reference the interfaces as well as the upstream bridge. In addition the bond configuration specifies the OVS specific settings for the bond (active-backup, no LACP, 100ms status monitoring):

# vi /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBond
OVS_BRIDGE=cloudbr0
BOOTPROTO=none
BOND_IFACES="eth0 eth1"
OVS_OPTIONS="bond_mode=active-backup lacp=off other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100"
HOTPLUG=no
# vi /etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE=bond1
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBond
OVS_BRIDGE=cloudbr1
BOOTPROTO=none
BOND_IFACES="eth2 eth3"
OVS_OPTIONS="bond_mode=active-backup lacp=off other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100"
HOTPLUG=no

The bridges are now configured as follows. The host IP address is specified on the untagged cloudbr0 bridge:

# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr0
DEVICE=cloudbr0
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROTO=static
IPADDR=192.168.100.20
NETMASK=255.255.255.0
GATEWAY=192.168.100.1
HOTPLUG=no

Cloudbr1 is configured without an IP address:

# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr1
DEVICE=cloudbr1
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROTO=none
HOTPLUG=no

Internal bridge cloud0

Under CentOS7.x and CloudStack 4.11 the cloud0 bridge is automatically configured, hence no additional configuration steps required.

Optional tagged interface for storage traffic

If a dedicated VLAN tagged IP interface is required for e.g. storage traffic this is accomplished by creating a VLAN tagged fake bridge on top of one of the cloud bridges. In this case we add it to cloudbr0 with VLAN 100:

# ovs-vsctl add-br cloudbr100 cloudbr0 100
# vi /etc/sysconfig/network-scripts/ifcfg-cloudbr100
DEVICE=cloudbr100
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROTO=static
IPADDR=10.0.100.20
NETMASK=255.255.255.0
OVS_OPTIONS="cloudbr0 100"
HOTPLUG=no

Additional OVS network settings

To finish off the OVS network configuration specify the hostname, gateway and IPv6 settings:

vim /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=kvmhost1.mylab.local
GATEWAY=192.168.100.1
NETWORKING_IPV6=no
IPV6INIT=no
IPV6_AUTOCONF=no

VLAN problems when using OVS

Kernel versions older than 3.3 had some issues with VLAN traffic propagating between KVM hosts. This has not been observed in CentOS 7.5 (kernel version 3.10) – however if this issue is encountered look up the OVS VLAN splinter workaround.

Network startup

Note – as mentioned for linux bridge networking – once all network startup scripts are in place and the network service is restarted you may lose connectivity to the host if there are any configuration errors in the files, hence make sure you have console access to rectify any issues.

To make the configuration live restart the network service:

# systemctl restart network

To check the bridges use the ovs-vsctl command. The following shows the optional cloudbr100 on VLAN 100:

# ovs-vsctl show
49cba0db-a529-48e3-9f23-4999e27a7f72
    Bridge "cloudbr0";
        Port "cloudbr0";
            Interface "cloudbr0"
                type: internal
        Port "cloudbr100"
            tag: 100
            Interface "cloudbr100"
                type: internal
        Port "bond0"
            Interface "veth0";
            Interface "eth0"
    Bridge "cloudbr1"
        Port "bond1"
            Interface "eth1"
            Interface "veth1"
        Port "cloudbr1"
            Interface "cloudbr1"
                type: internal
    Bridge "cloud0"
        Port "cloud0"
            Interface "cloud0"
                type: internal
    ovs_version: "2.9.2"

The bond status can be checked with the ovs-appctl command:

ovs-appctl bond/show bond0
---- bond0 ----
bond_mode: active-backup
bond may use recirculation: no, Recirc-ID : -1
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
lacp_status: off
active slave mac: 00:0c:xx:xx:xx:xx(eth0)

slave eth0: enabled
active slave
may_enable: true

slave eth1: enabled
may_enable: true

To ensure that only OVS bridges are used also check that linux bridge control returns no bridges:

# brctl show
bridge name	bridge id		STP enabled	interfaces

As a final note – the CloudStack agent also requires the following two lines added to /etc/cloudstack/agent/agent.properties after install:

network.bridge.type=openvswitch
libvirt.vif.driver=com.cloud.hypervisor.kvm.resource.OvsVifDriver

OpenVswitch bridge configuration – Ubuntu

As discussed earlier in this blog post Ubuntu 18.04 introduced Netplan as a replacement to the legacy “/etc/network/interfaces” network configuration. Unfortunately Netplan does not support OVS, hence the first challenge is to revert Ubuntu to the legacy configuration method.

To disable Netplan first of all add “netcfg/do_not_use_netplan=true” to the GRUB_CMDLINE_LINUX option in /etc/default/grub. The following example also shows the use of legacy interface names as well as IPv6 being disabled:

GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0 ipv6.disable=1 netcfg/do_not_use_netplan=true"

Then rebuild GRUB and reboot the server:

grub-mkconfig -o /boot/grub/grub.cfg

To set the hostname first of all edit “/etc/cloud/cloud.cfg” and set this to preserve the system hostname:

preserve_hostname: true

Thereafter set the hostname with hostnamectl:

hostnamectl set-hostname --static --transient --pretty <hostname>

Now remove Netplan, install OVS from the Ubuntu repositories as well the “ifupdown” package to get standard network functionality back:

apt-get purge nplan netplan.io
apt-get install openvswitch-switch
apt-get install ifupdown

As for CentOS we need to blacklist the bridge module to prevent standard bridges being created. Please note the CloudStack agent install depends on the bridge module being in place, hence this step should be carried out after agent install.

echo "install bridge /bin/false" > /etc/modprobe.d/bridge-blacklist.conf

To stop network bridge traffic from traversing IPtables / ARPtables also add the following lines to /etc/sysctl.conf:

# vi /etc/sysctl.conf
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

Same as for CentOS we first of all add the OVS bridges and bonds from command line using the ovs-vsctl command line tools. In this case we also add the additional tagged fake bridge cloudbr100 on VLAN 100:

# ovs-vsctl add-br cloudbr0
# ovs-vsctl add-br cloudbr1
# ovs-vsctl add-bond cloudbr0 bond0 eth0 eth1 bond_mode=active-backup other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100
# ovs-vsctl add-bond cloudbr1 bond1 eth2 eth3 bond_mode=active-backup other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100
# ovs-vsctl add-br cloudbr100 cloudbr0 100

As for linux bridge all network configuration is applied in “/etc/network/interfaces”:

# vi /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual

auto cloudbr0
allow-ovs cloudbr0
iface cloudbr0 inet static
  address 192.168.100.20
  netmask 255.255.155.0
  gateway 192.168.100.100
  dns-nameserver 192.168.100.5
  ovs_type OVSBridge
  ovs_ports bond0

allow-cloudbr0 bond0 
iface bond0 inet manual 
  ovs_bridge cloudbr0 
  ovs_type OVSBond 
  ovs_bonds eth0 eth1 
  ovs_option bond_mode=active-backup other_config:miimon=100

auto cloudbr1
allow-ovs cloudbr1
iface cloudbr1 inet manual

allow-cloudbr1 bond1 
iface bond1 inet manual 
  ovs_bridge cloudbr1 
  ovs_type OVSBond 
  ovs_bonds eth2 eth3 
  ovs_option bond_mode=active-backup other_config:miimon=100

Network startup

Since Ubuntu 14.04 the bridges have started automatically without any requirement for additional startup scripts. Since OVS uses the same toolset across both CentOS and Ubuntu the same processes as described earlier in this blog post can be utilised.

# ovs-appctl bond/show bond0
# ovs-vsctl show

To ensure that only OVS bridges are used also check that linux bridge control returns no bridges:

# brctl show
bridge name	bridge id		STP enabled	interfaces

As mentioned earlier the following also needs added to the /etc/cloudstack/agent/agent.properties file:

network.bridge.type=openvswitch
libvirt.vif.driver=com.cloud.hypervisor.kvm.resource.OvsVifDriver

Internal bridge cloud0

In Ubuntu there is no requirement to add additional configuration for the internal cloud0 bridge, CloudStack manages this.

Optional tagged interface for storage traffic

Additional VLAN tagged interfaces are again accomplished by creating a VLAN tagged fake bridge on top of one of the cloud bridges. In this case we add it to cloudbr0 with VLAN 100 at the end of the interfaces file:

# ovs-vsctl add-br cloudbr100 cloudbr0 100
# vi /etc/network/interfaces
auto cloudbr100
allow-cloudbr0 cloudbr100
iface cloudbr100 inet static
  address 10.0.100.20
  netmask 255.255.255.0
  ovs_type OVSIntPort
  ovs_bridge cloudbr0
  ovs_options tag=100

Conclusion

As KVM is becoming more stable and mature, more people are going to start looking at using it rather that the more traditional XenServer or vSphere solutions, and we hope this article will assist in configuring host networking. As always we’re happy to receive feedback , so please get in touch with any comments, questions or suggestions.

About The Author

Dag Sonstebo is  a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends most of his time designing, implementing and automating IaaS solutions based on Apache CloudStack.

Introduction

CloudStack 4.11.1 introduces a new security enhancement on top of the new CA framework to secure live KVM VM migrations. This feature allows live migration of guest VMs across KVM hosts using secured TLS enabled libvirtd process. Without this feature, the live migration of guest VMs across KVM hosts would use an unsecured TCP connection, which is prone to man-in-the-middle attacks leading to leakage of critical VM data (the VM state and memory). This feature brings stability and security enhancements for CloudStack and KVM users.

Overview

The initial implementation of the CA framework was limited to the provisioning of X509 certificates to secure the KVM/CPVM/SSVM agent(s)  and the CloudStack management server(s). With the new enhancement, the X509 certificates are now also used by the libvirtd process on the KVM host to secure live VM migration to another secured KVM host.

The migration URI used by two secured KVM hosts is qemu+tls:// as opposed to qemu+tcp:// that is used by an unsecured host. We’ve also enforced that live VM migration is allowed only between either two secured KVM hosts or two unsecured hosts, but not between KVM hosts with a different security configuration. Between two secured KVM hosts, the web of trust is established by the common root CA certificate that can validate the server certificate chain when live VM migration is initiated.

As part of the process of securing a KVM host the CA framework issues X509 certificates and provisions them to a KVM host and libvirtd is reconfigured to listen on the default TLS port of 16514 and use the same X509 certificates as used by thecloudstack-agent. In an existing environment, the admin will need to ensure that the default TLS port 16514 is not blocked however in a fresh environment suitable iptables rules and other configurations are done via cloudstack-setup-agent using a new '-s' flag.

Starting CloudStack 4.11.1, hosts that don’t have both cloudstack-agent and libvirtd processes secured and in Up state will show up in ‘Unsecure’ state in the UI (and in host details as part of listHosts API response):

This will allow admins to easily identify and secure hosts using a new ‘provision certificate’ button that can be used from the host’s details tab in the UI:

After a KVM host is successfully secured it will show up in the Up state:

As part of the onboarding and securing process, after securing all the KVM hosts the admin can also enforce authentication strictness of client X509 certificates by the CA framework, by setting the global setting ‘ca.plugin.root.auth.strictness' to true (this does not require restarting of the management server(s)).

About the author

Rohit Yadav is a Software Architect at ShapeBlue, the Cloud Specialists, and is a committer and PMC member of Apache CloudStack. Rohit spends most of his time designing and implementing features in Apache CloudStack.

Introduction

Last year we implemented a new CA Framework on CloudStack 4.11 to make communications between CloudStack management servers it’s hypervisor agents more secure. As part of that work, we introduced the ability for CloudStack agents to connect to multiple management servers, avoiding the usage of an external load balancer.

We’ve now extended the CA Framework by implementing load balance sorting algorithms which are applied to the list of management servers before being sent to the indirect agents.  This allows the CloudStack management servers to balance the agent load between themselves, with no reliance on an external load balancer. This will  be available in CloudStack 4.11.1 The new functionality also  introduces the notion of a preferred management server for agents, and a background mechanism to check and eventually connect to the preferred management server (assumed to be the first on the list the agent receives).

Overview

The CloudStack administrator is responsible for setting the list of management servers to connect to and an algorithm (to sort the management servers list) from the CloudStack management server using global configurations.

Management server perspective

This feature uses (and introduces) these configurations:

  • ‘indirect.agent.lb.algorithm’: The algorithm to be applied to the list of management servers on ‘host’ configuration before being sent to the agents. Allowed algorithm values are:
    • ‘static’: Each agent receives the same list as provided on ‘host’ configuration. Therefore, no load balancing performed.
    • ’roundrobin’: The agents are evenly spread across management servers
    • ‘shuffle’: Randomly sorts the list before being sent to each agent.
  • ‘indirect.agent.lb.check.interval’: The interval in seconds after which agent should check and try to connect to its preferred host.

Any changes to these global configurations are dynamic and do not require restarting the management server.

There are three cases in which new lists are propagated to the agents:

  • Addition of a host
  • Connection or reconnection of an agent
  • A change on the ‘host’ or ‘indirect.agent.lb.algorithm’ configurations

Agents perspective

Agents receive the list of management servers, the algorithm and the check interval (if provided) and persist them on their agent.properties file as:

hosts=<list-of-comma-separated-management-servers>@<algorithm>

The first management server on the list is considered the preferred host. The check interval to check for the preferred host should be greater than 0, in which case it is persisted on agent.properties on the ‘host.lb.check.interval’ key. In case the interval is greater than 0 and the host which the agent is connected to is not the preferred host, the agent will attempt connection to the preferred host

When connection is established between an agent and a management server, the agent sends its list of management servers. The management server checks if the list the agent has is up to date, sending the updated list if it is outdated. This behaviour ensures that each agent should get the updated version of the list of management servers even after any failure.

Examples

Assuming a test environment consisting on:

  • 3 management servers: M1, M2 and M3
  • 4 KVM hosts: H1, H2, H3 and H4

The ‘host’ global configuration should be set to ‘M1,M2,M3’

If the CloudStack administrator wishes no load balancing between agents and management servers, it would set the ‘static’ algorithm as the ‘indirect.agent.lb.algorithm’ global configuration. Each agent receives the same list (M1,M2,M3), and will be connected to the same management server.

If the CloudStack administrator wishes to balance connections between agents and management servers, the ’roundrobin’ algorithm is recommended. In this case:

  • H1 receives the list (M1, M2, M3)
  • H2 receives the list (M2, M3, M1)
  • H3 receives the list (M3, M1, M2)
  • H4 receives the list (M1, M2, M3)

There is also a ‘shuffle’ algorithm, in which the list is randomized before being sent to any agent. With this algorithm, the CloudStack administrator has no control of the load balancing so it is not recommened production use at the moment.

Combined with the algorithm, the CloudStack administrator can also set the ‘indirect.agent.lb.check.interval’ global configuration to ‘X’. This ensures that each X seconds, every agent will check if the management server they are connected to is the same as the first element of their list (preferred host). If there is a mismatch, the agent will attempt connecting to the preferred host.

About the author

Nicolas Vazquez is a Senior Software Engineer at ShapeBlue, the Cloud Specialists, and is a committer in the Apache CloudStack project. Nicolas spends his time designing and implementing features in Apache CloudStack.