Posts

Thursday, May 27 was the first-ever vCSEUG, and the first time the community had met since February 2020 (the last pre-COVID meetup in Berlin). It was time to reconnect! We have missed the chance to interact with other community members, learn what’s new in CloudStack and hear from the companies using it. So, we needed to find a solution to overcome all barriers. Organizing a virtual event was not only a great chance for the EU User Group members to meet but also to invite CloudStack community members and contributors from all around the globe to join in.

About the Attendees and Speakers

The vCSEUG proved to be a huge success. People joined from 23 countries and 4 continents (by my count) – from Germany, UK, Switzerland, India, Bulgaria, Greece, Poland, Serbia, Brazil, Chile, Russia, USA, Canada, Japan, France, Uruguay, Korea … (apologies if I have missed anyone)! We also had a record number of registrations and attendees for a CloudStack User Group Event. Physical distance was not an issue for our speakers, who joined the event from 6 different countries.

As usual, this was a half-day event, but we chose to have shorter talks so that we could accommodate a wider range of topics, proving that when you want to learn new things and meet the CloudStack community, nothing can stop you. During the day, we enjoyed 6 sessions from industry leaders, with conversation continuing in the ‘vPub’ after the last talk.

Without the need for sponsorship for a venue and refreshments, instead, we surprised some of the attendees with great prizes provided by the event sponsors – ShapeBlue, LINBIT, Dimsi, StorPool and PC Extreme – all of whom helped us to make the virtual event happen.

 

vCSEUG Proved to be a Huge Success

You may ask: “What is the secret behind making a virtual event happen?”. We think it is the great community we have and its dedication to doing something valuable and exciting after a long time of being not able to meet. This was evident by the number of attendees and the quality of talks, questions and ongoing collaboration.

If you missed the event or some of the talks, we are happy to share recordings as usual and will make all slides available on SlideShare. Continue reading and discover more about the event sessions and our awesome speakers.

 

vCSEUG Talks and Presentations

 

What’s New in CloudStack 4.15 – Giles Sirett

vCSEUG started with a talk from Giles Sirett, Chairman of the CSEUG and PMC member, Apache CloudStack. Giles welcomed the attendees and shared in-depth insights about the new features and functionalities in CloudStack 4.15. He also provided info on when 4.15.1 and 4.16 are expected, presented the new VP of Apache CloudStack (Gabriel Brascher), talked through the latest integrations to CloudStack, improvements in the UI, new OS support, advanced capabilities of vSphere, OVF support, dynamic roles enhancements and more. Giles talk in full here:

 

Customising the CloudStack UI – Abhishek Kumar

The next talk came from Abhishek Kumar, Software Developer at ShapeBlue, who focused on customizing the new UI. It aimed to teach administrators how to tailor the UI aesthetics to their organization’s preferences. The session targeted advanced users, training them to alter the layout of the UI, add, remove, or restrict resource actions from the UI. You can read more on how to customize the CloudStack UI on our blog and watch a full recording of Abhishesk’s session:

 

From Мetal to Service: 100% automation with Apache CloudStack and Ansible – Rafael del Valle

The next talk was presented by Rafael del Valle, Co-Founder of Celpax. His session was “From metal to service: 100% automation with Apache CloudStack and Ansible”. Celpax.com has recently deployed Apache CloudStack on Hetzner+Premises with full metal to service automation. In this talk, Rafael presented their success story. Furthermore, he shared why they chose open-source technologies and what advantages they got.

 

 

KVM High Availability Regardless of Storage – Gabriel Brascher

One of the most highly-anticipated talks was by the new Apache Cloudstack VP – Gabriel Beims Bräscher – talking about KVM High Availability Regardless of Storage. One of the great advantages of CloudStack is that it is vendor-independent, meaning you can decide the technology stack above and below based on your experience or needs. Having High Availability enabled for KVM hosts can improve greatly the QoS by handling (fence/recover) a problematic Host as well as re-starting its stopped VMs on healthy hosts. However, there is a limitation on CloudStack HA for KVM – it relies mainly on NFS heartbeat script checks. Gabriel’s talk illustrated how CloudStack HA works for KVM hosts and presented a way of improving its implementation in a way that KVM HA works with any storage system pluggable on KVM, not just NFS.

 

CloudStack and Tungsten Fabric SDN Integration – Simon Weller, Radu Todirica

After a short break, we continued with CloudStack and Tungsten Fabric SDN Integration Update, presented by Radu Todirica and Simon Weller from Education Networks of America (ENA). Over the past year, ENA and EWERK have been collaborating on a new ACS plugin for the Tungsten Fabric SDN controller. Simon Weller and Radu Todirica provided an update on progress, feature overview and a live demo.

 

CloudStack Deployments for Edge Use Cases – Rudraksh Kulshreshtha

The honour of the last talk of the day was given to Rudraksh Kulshreshtha from IndiQus, who presented how to architect lean CloudStack deployments for Edge use cases.

 

After the last talk, and the usual round of questions and answers, we headed over to the vPub where conversation, collaboration and debate continued. We would like to thank all attendees, sponsors and people engaged in the event who made it happen.

See you soon on another virtual or maybe live event …

 

Aside from traditional storage solutions, CloudStack has supported managed storage for some time. In this article, we will touch on SolidFire support in CloudStack 4.13 and lay out the exact steps needed to add SolidFire to CloudStack as Primary Storage (for VMware, KVM and XenServer). We will also explain the difference between the “SolidFire” and “SolidFireShared” plugins and discuss their use cases.

There will be a follow up article covering different feature sets that different hypervisors have when it comes to using SolidFire as Primary Storage, and we’ll also examine the way things work under the hood.

SolidFire 101

SolidFire has been around for many years, and the fact that it was acquired by NetApp (in early 2016) speaks for itself. SolidFire is an iSCSI-based, all-flash, distributed SAN solution, providing granular QoS on a per-LUN basis. A minimal cluster consists of 4 nodes, and newer generations of SolidFire models are able to provide 100,000 IOPS per single node. That means up to 400,000 IOPS per 4-node SolidFire cluster in just 4U of rack space (all IOPS figures assume a 4K IO size).

Different models of SolidFire nodes are available – currently 3 models (all with 100,000 IOPS per node). The differences between models are the size of SSDs and the amount of system memory / read cache. For more info on the different node models available, please visit https://www.netapp.com/us/products/storage-systems/all-flash-array/solidfire-scale-out.aspx.

Importantly, SolidFire supports mixing and matching nodes. So – if you are short on space, you can add bigger nodes to your cluster, whilst if you are short of IOPS, you can expand your cluster with smaller nodes. As a distributed SAN it has the advantage of being able to scale very well.

A great aspect of SolidFire is its granular, per-volume QoS. For each volume (LUN) created on the cluster, you can set its minimum, maximum and burst IOPS values / limits. Let’s briefly explain this:

  • Min IOPS: defines a guaranteed IOPS performance in normal conditions and in most failure / expansion scenarios. This means that having a dead SSD / node, or expanding the cluster with additional nodes (with data being redistributed) will not influence a client’s IOPS as the iSCSI client will always be able to reach its Min IOPS for a given volume.
  • Max IOPS: defines the maximum sustained IOPS performance for a volume. This means that if the client is (eg) benchmarking, the sustained IOPS numbers will be equal to the volume’s Max IOPS.
  • Burst IOPS: defines the allowed burst IOPS performance for a volume / LUN. This is very useful for VM reboots, DB backups and similar scenarios which require short IO bursts. A volume accrues 1 second of burst credit (up to a maximum of 60 seconds) for every second that the volume runs below its Max IOPS limit.

Regarding Max and Burst IOPS limits – they are just limits. It’s not guaranteed that the volume / LUN can achieve those numbers if your cluster is very busy. Those limits will be reached (when required by client / application) if the cluster has enough “unused” IOPS, as in the following example:

  • If your cluster has 400,000 IOPS of capacity but is only using 250,000, that leaves 150,000 to be consumed across the cluster, meaning that a single volume (if so configured) may theoretically achieve up to 150,000 IOPS

For volume QoS limits, it’s advisable to follow the user guide for the version of Element Software (formerly Element OS) that you are running on your nodes. Currently, those limits (Element Software v11.3) are as follows:

  • Min IOPS per volume: cannot exceed 15,000
  • Max IOPS per volume: cannot exceed 200,000

For other limits, please consult the Element Software User Guide.

SolidFire Plugins for CloudStack

There are 2 plugins: “SolidFire” and “SolidFireShared”.

SolidFire 1:1 plugin

The “SolidFire” plugin (referred to here as “SolidFire 1:1”) provides a 1:1 mapping between a CloudStack volume and a SolidFire volume (LUN), and the QoS you want for that specific CloudStack volume is configured on the SolidFire volume (LUN).

For each CloudStack volume created, it will do the following:

  • For VMware, create a dedicated VMware Datastore for each CloudStack volume.
  • For XenServer, create a dedicated XenServer SR for each CloudStack volume.
  • For KVM, create a “dedicated” iSCSI session for each CloudStack volume on a KVM host, effectively passing-through the iSCSI LUN (SolidFire volume) to a VM.

The main benefit of this plugin is that for each CloudStack volume you can set QoS as defined via Compute / Disk Offerings in the Storage QoS section. The plugin will take the “Min IOPS” and the “Max IOPS” setting (Burst IOPS is preconfigured as a multiplier of the maximum IOPS) – and will send those values to the SolidFire cluster’s API, so the values are set on the SolidFire volume / LUN. This way, CloudStack (via the plugin) manages the volumes on the SolidFire cluster, thus the name “Managed Storage”.

The downside of this plugin is that the number of Datastores (VMware) and SRs (XenServer) is limited to a relatively low value (native hypervisor limitations):

  • VMware 6.5 – maximum of 512 datastores per cluster (hard limit)
  • XenServer 6.x-8.0 – soft limit of 256 SRs (users have tested up to 500-600 SRs, but the time to mount new SRs becomes considerably higher with that many SRs as well as the time to reboot a host)
  • No particular limits for KVM

This means that for VMware and XenServer you cannot have more than ~500 volumes per cluster, but since volumes are stored on the datastore / SR, you can create VM snapshots. For KVM it’s not possible to create VM snapshots, since the iSCSI LUN is passed-through to the VM, so there is no QCOW2 file(s) in play – and KVM VM snapshots are only possible with QCOW2 files (i.e. not possible with any RAW block storage).

SolidFireShared plugin

The “SolidFireShared” plugin provides a many:1 mapping; ie. many CloudStack volumes on a single SolidFire volume, providing an alternative way to organize CloudStack volumes on SolidFire-based Primary Storage, and partially solving the scalability issues that exist when using the SolidFire 1:1 plugin (explained in the previous section). This plugin only supports VMware and XenServer.

Adding Primary Storage to CloudStack using the SolidFireShared plugin will result in the following:

  • For VMware, a new datastore being created immediately, formatted with VMFS5 and mounted on all ESXi hosts in the cluster.
  • For XenServer, a new SR being created immediately, using LVM (lvmoiscsi) and attached to all XenServers in a pool / cluster
  • All volumes will be placed on this shared LUN (datastore/SR).

With this plugin you can have a single datastore / SR for many CloudStack volumes and thus the number of volumes can be greater than the ~500 volumes with the SolidFire 1:1 plugin. However, with this setup, QoS is defined per whole datastore / SR, not per single CloudStack volume.

The SolidFireShared plugin requires that the Primary Storage be added as cluster-wide, i.e. zone-wide Primary Storage is not supported with this plugin (nor would it make much sense due to the native hypervisor limits).

As you can guess, you could do this setup manually (without using the SolidFireShared plugin), However, the plugin automates these steps making it less error-prone than the manual process.

If doing everything manually, the steps are as follows (the first 4 steps are done via the SolidFire UI or API):

  • create an Account (linked to your CloudStack installation).
  • create a list of allowed iSCSI initiators, which means all of your hosts in the specific cluster (get initiators IQN from your hypervisor hosts).
  • create a large enough SolidFire Volume with desired QoS.
  • create an Access Group, adding all previously created Initiators and the Volume to it.
  • Add an iSCSI-based Datastore / SR to Vmware / XenServer via vCenter / XenCenter.
  • Add new Primary Storage in CloudStack; for VMware use “VMFS” as a protocol and specify the previously created datastore name; for XenServer use PreSetup as a protocol and specify the previously created SR.

VMware setup

Before heading out to the CloudStack GUI and adding SolidFire / SolidFireShared-based Primary Storage, make sure that you:

  • Have an iSCSI Software adapter enabled on all ESXi hosts in the cluster.
  • Have done proper network binding of the iSCSI adapter to the correct vSwitch, so that your ESXi hosts will have an IP in the same VLAN as the SolidFire SVIP (Storage VIP).

Adding SolidFire 1:1-based Primary Storage

If adding zone-wide storage, set hypervisor=Any parameter (this is required for all hypervisor types).

CloudMonkey command to add zone-wide Primary Storage:


create StoragePool scope=zone zoneid=af61811f-3ca6-4927-ab0d-5bb6d693e3e7 hypervisor=Any name=SF121zonewide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2" tags=SF121ZONE

(NOTE: due to a very long URL parameter value, we have broken the URL value into multiple lines for readability – otherwise it should be a single line with no spaces)

For cluster-wide Primary Storage, syntax is slightly different:


create StoragePool scope=cluster zoneid=af61811f-3ca6-4927-ab0d-5bb6d693e3e7 podid=954065ed-a173-4c52-9f6f-062cd9b17ddb clusterid=72750371-a6ce-4d97-b567-1a9aefc416f8 name=SF121clusterwide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2;datacenter=Trillian" tags=SF121cluster

Most of the parameters are self-explanatory, but some do need an additional explanation:

  • The “capacityBytes” parameter is the logical / virtual size you want to deliver to CloudStack from the SolidFire cluster. The sum of the volumes, snapshots, and templates that reside on this Primary Storage cannot exceed “capacityBytes”. SolidFire performs compression and deduplication as well as leveraging thin provisioning, so the actual space used is usually much better than the sum of these virtual sizes.
  • “capacityIops”, in similar fashion, defines the total IOPS capacity that can be consumed on the SolidFire side (the sum of the Min IOPS (min_iops as visible in the cloud.volumes table in CloudStack DB)). The sum of all volumes created in CloudStack cannot exceed this value.
  • “MVIP” and “SVIP” (Management VIP and Storage VIP)
  • “clusterDefaultMinIops” and “clusterDefaultMaxIops” are default values that a CloudStack volume will get if there was no Min IOPS and Max IOPS values specified in the Compute / Disk offering.
  • “clusterDefaultBurstIopsPercentOfMaxIops” defines a Burst IOPS and is a decimal multiplier of the Max IOPS.
  • “datacenter” needs to point to the specific VMware datacenter; storage tags are optional.

We have shared the API calls in the above examples, but you can also use the GUI.

When creating Compute / Disk offerings, make sure to define “storage” as the type of QoS as shown on the image below. You can also set Min and Max IOPS for the volume here – these values will be taken from this offering and passed to the SolidFire API (via the plugin), so that the desired QoS is set on the SolidFire volume / LUN.

NOTE:  For both VMware and XenServer, you should set the “Hypervisor Snapshot Reserve” value (expressed as a percentage of the volume size). In the example above, for a 100GB volume, 200% is set (200GB) so the datastore (SolidFire volume / LUN) will be 300GB. If we didn’t set the value for this setting, the datastore would be created with the same size as the volume (100GB in this example) and taking VM snapshots would be impossible, since there would be no free space on the datastore. Since all SolidFire volumes are thinly provisioned, there is zero difference on the actual space consumption on the SolidFire cluster if the datastore is 100GB or 1TB, so make sure to take that into account.

Adding SolidFireShared-based Primary Storage

As previously mentioned, Primary Storage based on the SolidFireShared plugin can only be cluster-wide, so there are no variations regarding the scope parameter when it comes to the API call:


create StoragePool scope=cluster zoneid=af61811f-3ca6-4927-ab0d-5bb6d693e3e7 podid=954065ed-a173-4c52-9f6f-062cd9b17ddb clusterid=72750371-a6ce-4d97-b567-1a9aefc416f8 name=SFSHARED provider=SolidFireShared managed=false capacityBytes=107374182400 capacityIops=15000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;minIops=15000;
maxIops=100000;burstIops=100000;datacenter=Trillian" tags=SFSHARED

Note the slightly different URL syntax than the one used with the SolidFire 1:1 plugin.

Some of the chosen parameters need an explanation:

  • “capacityBytes” is the size of the SolidFire volume / LUN, so make it a big number.
  • “capacityIops” needs to be the same value as the “minIops” (part of the “url” section), and as already mentioned, a single SolidFire volume cannot have more than 15,000 for its Min IOPS.
  • “maxIops” and “burstIops” may not exceed 100,000 IOPS, but you can later set these values up to the volume’s limits (200,000 IOPS currently) in the SolidFire UI.

When choosing the value for “capacityBytes” (which translates to the size of the datastore), make sure to consider any additional size needed for VM / volume snapshots.

Once you have added SolidFireShared-based Primary Storage, you’ll need to create Compute / Disk offerings as usual, but this time without defining QoS on the Storage level in the Compute / Disk offerings (as we are not managing QoS on SolidFire any further, beside setting it initially during the creation of the Primary Storage). Also, it’s not necessary to define the “Hypervisor Snapshot Reserve” value, since this parameter is only consumed by the SolidFire 1:1 plugin when creating a datastore for each volume. These settings apply for both VMware and XenServer.

XenServer setup

Before trying to add SolidFire to CloudStack, make sure that you have configured your XenServers’ networks in such a way that they can access the SVIP of the SolidFire cluster. That usually means creating an additional network on the storage VLAN and creating an IP address on that network.

Once your XenServer hosts can communicate with the SVIP of the SolidFire cluster, you are ready to add a new SolidFire Primary Storage.

Adding SolidFire 1:1-based Primary Storage

As already stated in the VMware setup guide, make sure to set hypervisor=Any parameter in your API call for creating a zone-wide Primary Storage. The syntax is pretty much the same as for VMware.

CloudMonkey command to add zone-wide Primary Storage:


create StoragePool scope=zone zoneid=d2e2da70-204c-42b3-84d1-07917a2383a7 hypervisor=Any name=SF121zonewide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2" tags=SF121ZONE

For cluster-wide Primary Storage, the syntax is slightly different – the difference to the VMware setup is the absence of the “datacenter” parameter in the URL:


create StoragePool scope=cluster zoneid=d2e2da70-204c-42b3-84d1-07917a2383a7 podid=711b8d51-8f67-4b89-8e68-7d7a28a013b0 clusterid=b98afe80-9614-48b3-aba1-b79624086bb9 name=SF121clusterwide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2" tags=SF121cluster

If some of the parameters used in the API call are unclear, please check the VMware setup guide above, where you’ll find detailed explanations for each parameter.

For the Compute / Disk offering parameters that are needed specifically when using the SolidFire 1:1 plugin, please also see the corresponding section in the VMware setup guide – same “rules” apply here.

Adding SolidFireShared-based Primary Storage

Again, Primary Storage based on the SolidFireShared plugin can only be cluster-wide, so there are no variations when it comes to the scope parameter of the API call – same syntax as with VMware, we just skip the “datacenter” parameter in the URL:


create StoragePool scope=cluster zoneid=d2e2da70-204c-42b3-84d1-07917a2383a7 podid=711b8d51-8f67-4b89-8e68-7d7a28a013b0 clusterid=b98afe80-9614-48b3-aba1-b79624086bb9 name=SFSHARED provider=SolidFireShared managed=false capacityBytes=107374182400 capacityIops=15000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;minIops=15000;
maxIops=100000;burstIops=100000" tags=SFSHARED

In regard to the explanation of important parameters as well as different notes on the Compute/Disk offerings, please see the VMware setup for the SolidFireShared plugin above, which explains those in detail.

KVM setup

KVM, being a pretty much “unmanaged” hypervisor, is a bit different in terms of what you can do with it, and it’s much easier to make low-level changes as required. In that sense, the SolidFire 1:1 plugin works perfectly well and thus there is no need for SolidFireShared plugin support – though you can always do the big-shared-iSCSI-LUN-with-clustered-(God-forbid)-file-system yourself if you really want to.

Adding SolidFire 1:1-based Primary Storage

Before trying to add SolidFire-based Primary Storage, make sure to do the following:

  • Attach the proper storage VLAN with an IP address to all KVM hosts, so that the SolidFire SVIP is reachable.
  • Install an iSCSI initiator on all KVM hosts with yum install iscsi-initiator-utils or apt-get install open-iscsi. This will create the following file: /etc/iscsi/initiatorname.iscsi, which contains the IQN of the host (that will later be added to the cloud.host table, “url” field – this happens with all hypervisors).

You can set up both zone-wide and cluster-wide Primary Storage, as in the case of VMware and XenServer. The parameter “hypervisor” should still be set to “Any” (though the plugin will not complain if you set “hypervisor=KVM”, but will still set it to “Any” internally in the database):

CloudMonkey command to create zone-wide Primary Storage:


create StoragePool scope=zone zoneid=06938de4-0a5b-46f9-bbe7-5a264f43d4eb hypervisor=Any name=SF121zonewide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2" tags=SF121ZONE

Again, the syntax for cluster-wide Primary Storage is slightly different – but otherwise identical to XenServer syntax:


create StoragePool scope=cluster zoneid=06938de4-0a5b-46f9-bbe7-5a264f43d4eb podid=62717320-3fc4-4c53-9345-c53eba516710 clusterid=79064c12-659a-4886-8c4d-5ee38c842a0f name=SF121clusterwide provider=SolidFire managed=true capacityBytes=107374182400 capacityIops=10000 url="MVIP=10.10.10.10;SVIP=10.254.10.10;clusterAdminUsername=admin;
clusterAdminPassword=password;clusterDefaultMinIops=1000;
clusterDefaultMaxIops=2000;clusterDefaultBurstIopsPercentOfMaxIops=2" tags=SF121cluster

If some of the parameters used in the API call are unclear, please check the VMware setup guide, where you’ll find detailed explanations for each important parameter.

For the Compute / Disk offering parameters, it’s still required to set Min and Max IOPS as the Storage Quality Of Service parameters – but there is no need to define “Hypervisor Snapshot Reserve”, since there is no datastore / SR with KVM, and VM snapshots are not supported, so nothing to reserve space for.

This concludes this part of the SolidFire article series. In the next part, we cover different feature sets that different hypervisors have when it comes to using SolidFire as Primary Storage, and we’ll also examine the way things work under the hood.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer and PMC member of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.
We would like to thank Mike Tutkowski , Senior Software Developer at SolidFire who implemented the SolidFire plugin in CloudStack, for his review and help with this article.

 

Introduction

In the previous two parts of this article series, we have covered the complete Ceph installation process and implemented Ceph as an additional Primary Storage in CloudStack. In this final part, I will show you some examples of working with RBD images, and will cover some Ceph specifics, both in general and related to the CloudStack.

RBD image manipulations

In case you need to do some low-level client support, you can even try to mount that image as the local disk on any KVM (or Ceph) node. For this purpose, we use a tool, conveniently named, “rbd”, which is used to operate RBD images in general (i.e. to create new images, snapshots, clones, delete images, etc.).

From any KVM node, let’s attempt to use rbd kernel client to mount an image:

[root@kvm1 ~]# rbd map cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
rbd: sysfs write failed
RBD image feature set mismatch. Try disabling features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

As you can, see the above map command will fail. This happens as we are running on default kernel 3.x on the latest CentOS 7.6 and some of the new Ceph image features are not supported by the kernel client. Actually, even upgrading kernel to 5.0 will not bring Ceph kernel client to a usable state for our Mimic (or even Luminous) cluster – so one could wonder why kernel client exists at all….

So what we should do is to use rbd-nbd tool to map an image. Rbd-nbd is a client for RADOS block device (RBD) images similar to rbd kernel module, but unlike the rbd kernel module (which communicates with Ceph cluster directly), rbd-nbd uses NBD (generic block driver in kernel) to convert read/write requests to proper commands that are sent through network using librbd (user space client).

So, as stated, we are using librbd which is always on par with the cluster capabilities/features. But here the fun begins – the NBD kernel driver is not available by default with CentOS 7 (Red Hat decided to not include it with their kernel), so you have a couple of options: either you can rebuild the specific kernel version from the official kernel sources and extract the NBD kernel module or you can upgrade a kernel to the one provided by a well-known Elrepo repository, which I find simpler and easier to manage in the long run. For those of you possibly not familiar with Elrepo repository, this is a community repository, providing many different kinds of additional packages for Centos/RHEL, including fresh kernel versions – Elrepo kernel packages are built from the official sources while the kernel configuration is based upon the default RHEL configuration with added functionality enabled as appropriate.

A simple way to move to Elrepo kernel would be as following:

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
yum install https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum --enablerepo=elrepo-kernel install kernel-lt

If the above URLs become invalid, please find the new one at https://elrepo.org/tiki/tiki-index.php.

Now that we’ve got the new kernel installed, let’s check the order of kernels offered for boot:

[root@kvm1 ~]# awk -F\' /^menuentry/{print\$2} /etc/grub2.cfg
CentOS Linux (4.4.178-1.el7.elrepo.x86_64) 7 (Core)
CentOS Linux (3.10.0-957.10.1.el7.x86_64) 7 (Core)
CentOS Linux (3.10.0-693.11.1.el7.x86_64) 7 (Core)
CentOS Linux (3.10.0-327.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-76b17df4966743ce9a20fe9a7098e2b6) 7 (Core)

Above we see our new kernel (version 4.4) on position 0, so let’s set it as the default one, and reboot:

grub2-set-default 0
reboot

Finally, with the NBD driver in place (as part of new kernel), let’s install rbd-nbd and mount our image:

[root@kvm1 ~]#  yum install rbd-nbd -y 
[root@kvm1 ~]#  rbd-nbd map cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
/dev/nbd0

The above map command completed successfully (make sure that the image/volume is not mounted elsewhere to avoid file system corruption) and we can now operate /dev/nbd0 as any other locally attached drive, i.e.:

[root@kvm1 Ceph]# fdisk -l /dev/nbd0
 
Disk /dev/nbd0: 5368 MB, 5368709120 bytes, 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes
Disk label type: dos
Disk identifier: 0x264c895d
 
     Device Boot      Start         End      Blocks   Id  System
/dev/nbd0p1            8192    10485759     5238784   83  Linux

You can also show any other mapped images (if any):

[root@kvm1 Ceph]# rbd-nbd list-mapped 
id    pool       image                                snap device
11371 cloudstack ad9a6725-4a65-4c8b-b60a-843ed88618be -    /dev/nbd0

When done with this, unmap the RBD image with:

[root@kvm1 Ceph]# rbd-nbd unmap /dev/nbd0 

Alternatively, just to make the above exercise more complete, we can also attach the RBD image to our host with qemu-nbd (as you can guess this also requires NBD kernel module, a.k.a. newer kernel):

qemu-nbd --connect=/dev/nbd0 rbd:cloudstack/ad9a6725-4a65-4c8b-b60a-843ed88618be
qemu-nbd --disconnect /dev/nbd0

Again, the above tool talks to librbd, which relies on ceph.conf and admin key being present in /etc/ceph/ folder.

RBD image manipulations, for real this time

Away from the NBD magic and back to “rbd” tool – let’s briefly show some usage examples to manipulate RBD images.

At this point, I would expect you to be able to create a Compute Offering of your own, targeting Ceph as Primary Storage pool (similarly to how we created a Disk offering with “RBD” tag) and create a VM from a template. Assuming you have done so, let’s examine this VM’s ROOT image on Ceph.

Let’s list all volumes in our Ceph cluster:

[root@ceph1 ~]# rbd -p cloudstack ls
d9a1586d-a30b-4c52-99cc-c5ee6433fe18
fb3ee723-5e4e-48b3-ad7d-936162656cb4

In my example above (an empty cluster), I have only 2 images present, so let’s examine them:

[root@ceph1 ~]# rbd info cloudstack/d9a1586d-a30b-4c52-99cc-c5ee6433fe18
rbd image 'd9a1586d-a30b-4c52-99cc-c5ee6433fe18':
        size 50 MiB in 13 objects
        order 22 (4 MiB objects)
        id: 1b1d26b8b4567
        block_name_prefix: rbd_data.1b1d26b8b4567
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Mon Apr  8 21:40:21 2019

[root@ceph1 ~]# rbd info cloudstack/fb3ee723-5e4e-48b3-ad7d-936162656cb4
rbd image 'fb3ee723-5e4e-48b3-ad7d-936162656cb4':
        size 50 MiB in 13 objects
        order 22 (4 MiB objects)
        id: 1b250327b23c6
        block_name_prefix: rbd_data.1b250327b23c6
        format: 2
        features: layering
        op_features:
        flags:
        create_timestamp: Mon Apr  8 21:50:10 2019
        parent: cloudstack/d9a1586d-a30b-4c52-99cc-c5ee6433fe18@cloudstack-base-snap

What we see above is the first image of 50 MB in size (here I’m using a very small template from our community friends at http://www.openvm.eu/). For the second image, we see that it has it’s “parent” set to the first image. What is happening here is that CloudStack will copy over the template from Secondary Storage (creating image d9a1586d-a30b-4c52-99cc-c5ee6433fe18), it will then create a snapshots from this image (“cloudstack-base-snap”, as shown above) and protect it (required from Ceph side), and then create an image clone ( image fb3ee723-5e4e-48b3-ad7d-936162656cb4) with a parent image being the previously created/protected snapshot. This is effectively a linked clone setup, in Ceph’s world.

Let’s quickly emulate above behavior manually – we will just create an empty file instead of copying a real template from Secondary Storage pool:

rbd create -p cloudstack mytemplate --size 100GB
rbd snap create cloudstack/mytemplate@cloudstack-base-snap
rbd snap protect cloudstack/mytemplate@cloudstack-base-snap
rbd clone cloudstack/mytemplate@cloudstack-base-snap cloudstack/myVMvolume

Finally, let’s check our “myVMvolume” image:

[root@ceph1 ~]# rbd info cloudstack/myVMvolume
rbd image 'myVMvolume':
        size 100 GiB in 25600 objects
        order 22 (4 MiB objects)
        id: fcba6b8b4567
        block_name_prefix: rbd_data.fcba6b8b4567
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Mon Apr  8 22:08:00 2019
        parent: cloudstack/mytemplate@cloudstack-base-snap
        overlap: 100 GiB

The reason we have above protected a snapshots of the main template, is that it makes it impossible to delete this template file and thus cause major damage (effectively destroy all VMs from this template). So in order to revert the exercise from above, we would need to first remove a clone (VM image), unprotect and delete the snapshot, and finally delete the “template” image:

[root@ceph1 ~]# rbd rm cloudstack/myVMvolume
Removing image: 100% complete...done.
[root@ceph1 ~]# rbd snap unprotect cloudstack/mytemplate@cloudstack-base-snap
[root@ceph1 ~]# rbd snap rm cloudstack/mytemplate@cloudstack-base-snap
Removing snap: 100% complete...done.
[root@ceph1 ~]# rbd rm cloudstack/mytemplate
Removing image: 100% complete...done.

As you can see from above, the “rbd” tool is the tool to use to manipulate RBD images – for many other examples, please see http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/

In the next sections, I will try to cover a bit of what to expect, dos and don’ts of using Ceph cluster – again, this is not meant to be a comprehensive guide, since CloudStack storage feature sets related to Ceph and also Ceph itself is constantly changing and improving.

Cloudstack with Ceph

CloudStack has supported Ceph for many years now and though in the early days, many of the CloudStack’s storage features were not on pair with NFS, with the wider adoption of Ceph as the Primary Storage solution for CloudStack, most of the missing features have been implemented. That being said, there are still some minor specifics with Ceph:

  • In contrast to NFS, which provides a file system, Ceph provides raw block devices for VMs and thus it’s not possible to make a full VM snapshot the way it can be done with NFS – i.e. there is no filesystem to which to write the content of the RAM memory or other metadata needed.
  • Continuing on above, since no file system, there is (currently) no way to write a heartbeat file the way it’s implemented on NFS (but some work is currently being planned around this)
  • Speaking of a simple volume snapshot, it’s currently not possible to really restore a volume from a snapshot (which became possible in 4.11 with NFS or SolidFire Managed Storage). But this is rather a simple thing to implement.
  • By using Ceph with KVM and CloudStack, there are two external libraries used – librbd (a user-space client, used by Qemu to speak to Ceph cluster) and rados-java (a Java wrapper for librados). Historically, there have been some issues/bugs with rados-java (though they have been resolved long time ago), so you should probably keep an eye on these two and make sure they are up to date.

Learning curve

Ceph can be considered a rather complex storage system to comprehend and it definitively has a steep learning curve. Even when using Ceph on its own (i.e. not with CloudStack), make sure you know the storage system well before relying on it in production and also make sure to be able to troubleshoot problematic situations when they arise. Anyone prepared to put in a bit of effort and planning can deploy a nice Ceph cluster, but it takes skills and a deeper understanding of how things work under the hood when it comes to troubleshooting unusual situations or how (for example) replacing failed disks (or nodes) can influence performance of the clients (in our case, CloudStack VMs). Based on my previous experience managing a production Ceph cluster, I would definitively strongly suggest taking the above recommendations very seriously. On the other hand, experimenting with such an advanced storage solution can be really interesting and rewarding.

Performance considerations

Ceph is a great distributed storage solution that performs very well under sequential IO workload, can scale indefinitely and has a very interesting architecture. However, as with any distributed storage solution, when you write data to a cluster, it actually takes time to write that data to first node, have that write operation replicated to other 2 nodes (replica size 3) and send back ACK to the client that the write is successful. In case you are using NVME devices, like some users in community, you can expect very low latencies in the ballpark of around 0.5ms, but in case you opt out for a HDD based solution (with journals on SSDs) as many users do when starting their Ceph journey, you can expect 10-30ms of latency, depending on many factors (cluster size, networking latency, SSD/journal latency, and so on). In other words, don’t expect miracles with commodity hardware – if something “works on commodity hardware”, that doesn’t mean it also performs well. Actually, until 2-3 years ago, Ceph was (unofficially) considered unsuitable for any more serious random IO workload, which is backed up by referenced guides published between major HW vendors and RedHat, where they clearly stated that Ceph is not the “best choice” when it comes to random IO. In contrast to sequential IO workload benchmarks (where Ceph actually shines), these reference guides were completely missing any benchmark data with random IO workload/pattern. That being said, with the introduction of BlueStore as the new storage backend in recent Ceph releases and with some more architectural changes, Ceph has become much more suitable for pure SSD (and NVME) clusters and performance has improved drastically.

I hope this Ceph article series has been useful and interesting and helped you get up and running. All in all Ceph is an interesting development in the storage space and can provide a true cloud storage solution if implemented correctly.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

In the previous article we covered some basics around Ceph and deployed a working Ceph cluster. In this article, we are going to finish the Ceph configuration needed for CloudStack and add it as a new Primary Storage pool. We are also going to deploy Ceph volumes via CloudStack and examine them. Finally, in part 3 (to be published soon), I will show you some examples of working with RBD images and will cover some Ceph specifics, both in general and related to the CloudStack.

Before proceeding with the actual work, let me first mention that CloudStack supports Ceph with KVM only, so most of the work we do below is KVM related. Let’s define the high-level steps to be done:

  • Create a dedicated RBD pool for CloudStack in which all RBD images (volumes) will be created
  • Create a dedicated authentication key for the previously created pool
  • Update / install required Ceph binaries on KVM nodes
  • Add Ceph as Primary Storage in CloudStack
  • Implement custom storage tag for Ceph Primary Storage
  • Create new Compute / Disk offerings with same storage tag in order to target Ceph

From any Ceph node…

Ceph groups RBD (RADOS block device) images in pools and manages authentication on a per pool level. Each image is collection of many RADOS objects, with each object having a default size of 4MB (configurable per image). At this moment we have no pools created. But before creating a pool, let’s go through some basics around the different kind of pools in Ceph.

There are 2 kind of pools, based on the way the objects are stored across cluster:

  • Replicated – makes sure that there are always total of N replicas/copies of an object
  • Erasure Coding – simplest way to think of this is a network RAID 5/6

Replicated pools are used for better performance at the expense of space consumption, and you can think of it as a network-based RAID 1, where we have n number of replicas of an object. On the other hand, erasure coding pools are usually used when using Ceph for S3 Object Storage purposes and for more space efficient storage where bigger latency and lower performance is acceptable, since it is similar to RAID 5 or RAID 6 (requires some computation power). Here, for example, we may have 4 chunks of actual data and 2 parity chunks (EC 4+2), with just 50% of space overhead, while (depending on the setup), we can still survive losing a Ceph node or even two.

So, let’s create a dedicated pool for CloudStack, set its replica size and finally initialize it:

ceph osd pool create cloudstack 64 replicated
ceph osd pool set cloudstack size 3
rbd pool init cloudstack

The commands above will create a replicated pool named “cloudstack” with total of 64 placement groups (more info on placement groups here) with a replica size of 3, which is recommended for a production cluster. Optionally, you can set replica size of 2 during testing, for somewhat increased performance and less space consumed on the cluster.

Next, let’s generate a dedicated authentication key for our CloudStack pool:

ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'

The command above will output a key to STDOUT only – please save the given key, since we will use it when adding Ceph to CloudStack later:

[client.cloudstack]
key = AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

Now that the pool for CloudStack is ready, we need to prepare KVM nodes with proper Ceph binaries as well as the write-back caching configuration.

From the Ceph admin node…

Starting from Centos 7.2  (and Ubuntu 14.04), libvirt / QEMU comes by default with support for RBD, so there’s no need to compile the binaries yourself. That being said, if we check KVM nodes with “rpm -qa | grep librbd1″ it will return an existing versions of ‘librbd1” package (version 10.2.5 in my case)  already installed, but most certainly it will not be the current version that corresponds to the cluster version we just installed (13.2.5 in this case). For the record, librbd is a user space Ceph client, to which the qemu / libvirt talks effectively.

Furthermore, if we run command “ceph features” from any Ceph node, it will return (in our fresh Mimic cluster) “luminous” as the minimum compatible release version for the client – that means that our Ceph client (librbd) needs to be of a minimum of “luminous” version (which translates to 12.2.0), but our current librbd version is 10.2.5 – so let’s upgrade it to same Mimic versions as the version of our cluster:

ceph-deploy install --cli kvm1  kvm2

The command above will add Mimic repo to my two KVM nodes and install only the cli binaries (“ceph-common” package). This will also trigger the upgrade of existing “librbd1” package to the correct version. In addition, please make sure that name resolution of the KVM nodes works from the Ceph admin node.

Optionally, if you don’t want to install Ceph cli tools on KVM nodes, you can just upgrade the “librbd1” package while having previously created a proper Ceph Mimic repository on each KVM node (i.e. clone repo file from any Ceph cluster node).

Some of you might want to be able to manage Ceph cluster from KVM nodes as well (beside being able to manage it from Ceph nodes) and to be able to interact with RBD images with via “rbd” or “qemu-img“ tools – in this case we need this “rbd” tool installed on KVM nodes (part of “ceph-common” package, already installed in previous step), then we need ceph.conf locally on KVM nodes in order for the “rbd” tool to know how to connect to cluster, which MONs to target, etc. and finally we need the admin authentication key – this is the file “ceph.client.admin.keyring” which was created on our Ceph admin node when we created our cluster initially (in folder /root/CEPH-CLUSTER, as mentioned in Part 1 of this article series).

Additionally, if we want to use qemu-img tool to examine RBD images, we can either have qemu-img installed on the Ceph cluster nodes or we have to provide the above mentioned ceph.conf and admin keys in their default location (/etc/ceph/) on the KVM nodes, where librbd (client) will pick them up automatically, so we don’t need to specify MON IP/URL and admin key on the command line.

If you don’t want to be able to manage your Ceph cluster from KVM nodes, simply don’t copy over the “ceph.client.admin.keyring” file to KVM nodes. The ceph.conf file is still a must due to RBD caching as explained later. I have decided to make my KVM nodes happy by providing them with ceph.conf and admin keys, as below:

ceph-deploy admin kvm1 kvm2

The command above will effectively just copy ceph.conf and ceph.client.admin.keyring files to /etc/ceph/ folder on KVM nodes. Actually, you can still operate RBD images and manage your cluster from KVM nodes even if you don’t have ceph.conf and admin key present locally – you can always pass required parameters on the command line to “rbd” or “qemu-img” tools, as shown later.

RBD caching

After we have pushed the ceph.conf file to KVM nodes, librbd will read it for any configuration directives under the “[client]” section of that file (beside the other sections), but that section is missing at this moment!

Before we proceed into configuring the RBD caching, let me do here a copy/paste from the original docs that is important to understand regarding RBD caching:

” The user space implementation of the Ceph block device (i.e., librbd) cannot take advantage of the Linux page cache, so it includes its own in-memory caching, called “RBD caching.” RBD caching behaves just like well-behaved hard disk caching. When the OS sends a barrier or a flush request, all dirty data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes (i.e. Linux kernel >= 2.6.32). The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can coalesce contiguous requests for better throughput. “

After digesting the above info, we can proceed into a brief configuration of caching. We can either fix it manually on each KVM node by adding the missing section in ceph.conf file, or we can do it in a more proper way by changing ceph.conf on the Ceph admin node and then pushing new file version to all KVM (and optionally Ceph cluster) nodes:

cat << EOM >> /root/CEPH-CLUSTER/ceph.conf
[client]
  rbd cache = true
  rbd cache writethrough until flush = true
EOM
 
ceph-deploy --overwrite-conf admin kvm1 kvm2

Please note the above “writethrough until flush = true”. This is a safety mechanism which will force writethrough cache mode until it receives the very first flush request from the VM OS (which means that the OS is sending proper flush requests to the underlying storage, i.e. kernel >= 2.6.32) and then cache mode will change to the write-back, which actually brings performance benefits.

If case you want to play more with RBD caching, please see here – where you can find some important default values which we didn’t explicitly configure i.e. default rbd cache size is 32 MB (this is per volume) – so in case of 50 VMs with 4 volumes each, that translates to 50 x 4 x 32MB =  6.4GB of additional RAM consumed on a KVM host – keep that in mind !

Finally, let’s add the Ceph to CloudStack as an additional Primary Storage – we can do it via GUI or optionally via CloudMonkey (API) as following:

 

Or via CloudMonkey:

create storagepool scope=zone zoneid=3c764ee1-6590-417d-b873-f073d0c550be hypervisor=KVM name=MyCephCluster provider=Defaultprimary url=rbd://cloudstack:AQAFSZpc0t-BIBAAO95rOl+jgRwuOopojEtr_g==@10.2.2.219/cloudstack tags=RBD

Most of the parameters are self-explanatory but let’s explain a few of them:

  • RADOS Monitor: This is the IP address (or DNS name) of the Ceph Monitor (MON) instance – in my case I have defined a very first MON instance (IP address of the Ceph1 node from my cluster) – but in production environment you will want to have an internal Round Robin DNS setup on some internal DNS server (i.e. single zone on Bind) – such that KVM nodes will resolve the ULR (i.e. mon.myceph.cluster) in a round robin fashion to multiple MON instances – this is the way to achieve high availability of Ceph MONs, though some manual DNS zone changes are needed in case of prolonged MON maintenance
  • RADOS Pool: This is the pool “cloudstack” which we created in the beginning of the article
  • RADOS User and RADOS Secret: This are the values from the authentication key which we generated in the beginning of the article, shown below again for your convenience

[client.cloudstack]
key = AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

The above command, used to add Ceph to CloudStack, will effectively do a few things:

  • On each KVM node, it will create a new storage pool in libvirt
  • The storage pool definition files (xml and the secret) will be written to /etc/libvirt/secrets/ folder as shown below
  • Every time CloudStack Agent is restarted, it will recreate the Ceph storage pool (even if you manually remove the files below)

[root@kvm1]# cat /etc/libvirt/secrets/ef9cfd17-abe1-343d-97a0-cee6c71a6dad.xml
<secret ephemeral='no' private='no'>
  <uuid>ef9cfd17-abe1-343d-97a0-cee6c71a6dad</uuid>
  <usage type='ceph'>
    <name>cloudstack@ceph1.local:6789/cloudstack</name>
  </usage>
</secret>

[root@kvm1]# cat /etc/libvirt/secrets/ef9cfd17-abe1-343d-97a0-cee6c71a6dad.base64
AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

If we check the libvirt pool created above, we can see that it’s not persistent and it doesn’t start automatically – i.e. when you restart libvirt alone, it will not recreate / start the Ceph storage pool in libvirt– the CloudStack agent is the one doing this for us:

virsh # pool-info ef9cfd17-abe1-343d-97a0-cee6c71a6dad
Name:           ef9cfd17-abe1-343d-97a0-cee6c71a6dad
UUID:           ef9cfd17-abe1-343d-97a0-cee6c71a6dad
State:          running
Persistent:     no
Autostart:      no
Capacity:       299.99 GiB
Allocation:     68.19 MiB
Available:      286.02 GiB

Note that in the example above, I was actually using DNS name for the Ceph MON (ceph1.local) instead of the IP – Ceph MON’s DNS name is resolved to IP both when you add Ceph to CloudStack and every time you start a VM or attach new volume, etc. – so DNS resolution needs to be fast and stable here.

Now that we added Ceph to CloudStack, let’s create a Data disk offering with tag “RBD” – this will make sure that any new volume from this offering is created on storage pool with tag “RBD” – which is Ceph in our case . Here, we are using storage tags to avoid messing up with your existing CloudStack installation – but it’s not required otherwise:

(localcloud) SBCM5> > create diskoffering name=5GB-Ceph displaytext=5GB-Ceph storagetype=shared provisioningtype=thin customized=false disksize=5 tags=RBD
{
  "diskoffering": {
    "created": "2019-03-26T19:27:32+0000",
    "disksize": 5,
    "displayoffering": true,
    "displaytext": "5GB-Ceph",
    "id": "2c74becc-c39d-4aa8-beec-195b351bdaf0",
    "iscustomized": false,
    "name": "5GB-Ceph",
    "provisioningtype": "thin",
    "storagetype": "shared",
    "tags": "RBD"
  }
}

Note the offering ID from above (2c74becc-c39d-4aa8-beec-195b351bdaf0) – and let’s create a disk from it:

(localcloud) SBCM5> > create volume diskofferingid=2c74becc-c39d-4aa8-beec-195b351bdaf0 name=MyFirstCephDisk zoneid=3c764ee1-6590-417d-b873-f073d0c550be
{
  "volume": {
    "account": "admin",
    "created": "2019-03-26T19:52:05+0000",
    "destroyed": false,
    "diskofferingdisplaytext": "5GB-Ceph",
    "diskofferingid": "2c74becc-c39d-4aa8-beec-195b351bdaf0",
    "diskofferingname": "5GB-Ceph",
    "displayvolume": true,
    "domain": "ROOT",
    "domainid": "401ce404-44c1-11e9-96c5-1e009001076e",
    "hypervisor": "None",
    "id": "47b1cfe5-6bab-4506-87b6-d85b77d9b69c",
    "isextractable": true,
    "jobid": "49a682ab-42f9-4974-8e42-452a13c97553",
    "jobstatus": 0,
    "name": "MyFirstCephDisk",
    "provisioningtype": "thin",
    "quiescevm": false,
    "size": 5368709120,
    "state": "Allocated",
    "storagetype": "shared",
    "tags": [],
    "type": "DATADISK",
    "zoneid": "3c764ee1-6590-417d-b873-f073d0c550be",
    "zonename": "ref-trl-1019-k-M7-apanic"
  }
}

Finally, since volume creation is a lazy provisioning process (i.e. volume is created in DB only, not really on storage pool), let’s attach the disk to a running VM (using volume ID “47b1cfe5-6bab-4506-87b6-d85b77d9b69c” from previous command output), which will trigger the actual disk creation on our Ceph cluster (output shortened for brevity):

(localcloud) SBCM5> > attach volume id=47b1cfe5-6bab-4506-87b6-d85b77d9b69c virtualmachineid=19a67e20-c747-43bb-b149-c2b2294002f9
{
  "volume": {
    …
    "jobstatus": 0,
    "name": "MyFirstCephDisk",
    "path": "47b1cfe5-6bab-4506-87b6-d85b77d9b69c",
    …  }
}

Note the “path” output field (which is usually the same as the ID of the volume, except in some special cases) – and let’s check our Ceph cluster if we can find this volume and check it’s properties.

From any KVM node…

[root@kvm1 ~]# rbd ls -p cloudstack
47b1cfe5-6bab-4506-87b6-d85b77d9b69c
 
[root@kvm1 ~]# rbd info cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
rbd image '47b1cfe5-6bab-4506-87b6-d85b77d9b69c':
        size 5 GiB in 1280 objects
        order 22 (4 MiB objects)
        id: d43b4c04a8af
        block_name_prefix: rbd_data.d43b4c04a8af
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Tue Mar 28 19:46:32 2019

We can also examine the Ceph RBD image with qemu-img tool:

[root@kvm1 ~]# qemu-img info rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
image: rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c
file format: raw
virtual size: 5.0G (5368709120 bytes)
disk size: unavailable

As you can see in qemu-img command above, we did not specify any username and authentication keys, because we have our ceph.conf and the admin key files present in /etc/ceph/ folder. If you decided to opt-out of having these 2 files present on KMV nodes, you will have to use a cumbersome command as below:

qemu-img info rbd:cloudstack/47b1cfe5-6bab-4506-87b6-d85b77d9b69c:mon_host=10.2.2.219:auth_supported=Cephx:id=cloudstack:key=AQAFSZpc0t+BIBAAO95rOl+jgRwuOopojEtr/g==

In the above command we are specifying the MON IP address, username and key for authentication.

Now that you got the basics of consuming Ceph from CloudStack, feel free to also create Compute Offerings and System Offerings for Virtual Routers, Secondary Storage VM, Console Proxy VM and experiment with volume migration from i.e. NFS to Ceph. Be sure to have storage tags under control.

I hope that this article series has been interesting so far. In part 3 (which will be the final part), I will show you some examples of working with RBD images and will cover some Ceph specifics, both in general and related to CloudStack.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

As well as NFS and various block storage solutions for Primary Storage, CloudStack has supported Ceph with KVM for a number of years now. Thanks to some great Ceph users in the community lots of previously missing CloudStack storage features have been implemented for Ceph (and lots of bugs squashed), making it the perfect choice for CloudStack if you are looking for easy scaling of storage and decent performance.

In this and my next article, I am going to cover all steps needed to actually install a Ceph cluster from scratch, and subsequently add it to CloudStack. In this article I will cover installation and basic configuration of a standalone Ceph cluster, whilst in part 2 I will go into creating a pool for a CloudStack installation, adding Ceph to CloudStack as an additional Primary Storage and creating Compute and Disk offerings for Ceph. In part 3, I will also try to explain some of the differences between Ceph and NFS, both from architectural / integration point of view, as well as when it makes sense (or doesn’t) to use it as the Primary Storage solution.

It is worth mentioning that the Ceph cluster we build in this first article can be consumed by any RBD client (not just CloudStack). Although in part 2 we move onto integrating your new Ceph cluster into CloudStack, this article is about creating a standalone Ceph cluster – so you are free to experiment with Ceph.

Firstly, I would like to share some high-level recommendations from very experienced community members, who have been using Ceph with CloudStack for a number of years:

  • Make sure that your production cluster is at least 10 nodes so as to minimize any impact on performance during data rebalancing (in case of disk or whole node failure). Having to rebalance 10% of data has a much smaller impact (and duration) than having to rebalance 33% of data; another reason is improved performance as data is distributed across more drives and thus read / write performance is better
  • Use 10GB networking or faster – a separate network for client and replication traffic is needed for optimal performance
  • Don’t rely on cache tiering, unless you have a very specific IO pattern / use case. Moving data in and out of cache tier can quickly create a bottleneck and do more harm than good
  • If running an older version of Ceph cluster (eg. FileStore based OSD), you will probably place your journals on SSDs. If so, make sure that you properly benchmark SSD for the synchronous IO write performance (Ceph writes to journal devices with O_DIRECT and D_SYNC flags). Don’t try to put too many journals on single SSD; consumer grade SSDs are unacceptable, since their synchronous write performance is usually extremely bad and they have proven to be exceptionally unreliable when used in a Ceph cluster as journal device

Before we continue, let me state that this first article is NOT meant to be a comprehensive guide on Ceph history, theory, installation or optimization, but merely a simple step-by-step guide for a basic installation, just to get us going. Still, in order to be able to better follow the article, it’s good to define some basics around Ceph architecture.

Ceph has a couple of different components and daemons, which serves different purposes, so let’s mention some of these (relevant for our setup):

  • OSD (Object Storage Daemon) – usually maps to a single drive (HDD, SDD, NVME) and it’s the one containing user data. As can be concluded from it’s name, there is a Linux process for each OSD running in a node. A node hosting only OSDs can be considered as a Storage or OSD node in Ceph’s terminology.
  • MON (Monitor daemon) – holds the cluster map(s), which provides to Ceph Clients and Ceph OSD Daemons with the knowledge of the cluster topology. To clarify this further, in the heart of Ceph is the CRUSH algorithm, which makes sure that OSDs and clients can calculate the location of specific chunk of data in the cluster (and connect to specific OSDs for read/write of data), without a need to read it’s position from somewhere (as opposite to a regular file systems which have pointers to the actual data location on a partition).

A couple of other things are worth mentioning:

  • For cluster redundancy, it’s required to have multiple Ceph MONs installed, always aiming for an odd number to avoid a chance of split-brain scenario. For smaller clusters, these could be placed on VMs or even collocated with other Ceph roles (i.e. OSD nodes), though busier clusters will need a dedicated, powerful servers/VMs. In contrast to OSDs, there can be only one MON instance per server/VM.
  • For improved performance, you might want to place MON’s database (LevelDB) on dedicated SSDs (versus the defaults of being placed on OS partition).
  • There are two ways that OSDs can manage the data they store. Starting with the Luminous 12.2.z release, the new default (and recommended) backend is BlueStore. Prior to Luminous, the default (and only option) was FileStore. With FileStore, data is first written to a Journal (which can be collocated with the OSD on same device or it can be a completely separate partition on a faster, dedicated device) and then later committed to OSD. With BlueStore, there is no true Journal per se, but a RocksDB key/value database (for managing OSD’s internal metadata). FileStore OSD will use XFS on top of it’s partition, while BlueStore write data directly to raw device, without a need for a file system. With it’s new architecture, BlueStore brings big speed improvement over FileStore.
  • When building and operating a cluster, you will probably want to have a dedicated server/VM used as the deployment or admin node. This node will host your deployment tools (be it a basic ceph-deploy tool or a full blown ansible playbook), as well as cluster definition and configuration files, which can be changed on central place (this node) and then pushed to cluster nodes as required.

Armed with above knowledge (and against all recommendations given previously) we are going to deploy a very minimalistic installation of Ceph cluster on top of 3 servers (VMs), with 1 volume per node being dedicated for an OSD daemon, and Ceph MONs collocated with the Operating System on the system volume. The reason for choosing such a minimalistic setup is the ability to quickly build a test cluster on top of 3 VMs (which most people will do when building their very first Ceph cluster) and to keep configuration as short as possible. Remember, we just want to be able to consume Ceph from CloudStack, and currently don’t care about performance or uptime / redundancy (beside some basic things, which we will cover explicitly).

Our setup will be as following:

  • We will already have a working CloudStack 4.11.2 installation (i.e. we expect you to have a working CloudStack installation)
  • We will add Ceph storage as an additional Primary Storage to CloudStack and create offerings for it
  • CloudStack Management Server will be used as Ceph admin (deployment) node
  • Management Server and KVM nodes details:
    • CloudStack Management Server: IP 10.2.2.118
    • KVM host1: IP 10.2.3.135, hostname “kvm1”
    • KVM host2: IP 10.2.2.208, hostname “kvm2”
  • Ceph nodes details (dedicated nodes):
    • 2 CPU, 4GB RAM, OS volume 20GB, DATA volume 100GB
    • Single NIC per node, attached to the CloudStack Management Network – i.e. there is no dedicated network for Primary Storage traffic between our KVM hosts and the Ceph nodes
    • Node1: IP 10.2.2.119, hostname “ceph1”
    • Node2: IP 10.2.2.116, hostname “ceph2”
    • Node3: IP 10.2.3.159, hostname “ceph3”
    • Single OSD (100GB) running on each node
    • MON instance running on each node
    • Ceph Mimic (13.latest) release
    • All nodes will be running latest CentOS 7 release, with default QEMU and Libvirt versions on KVM nodes

As stated above Ceph admin (deployment) node will be on CloudStack Management Server, but as you can guess, you can use a dedicated VM/Server for this purpose as well.

Before proceeding with the actual work, let’s define the high-level steps required to deploy a working Ceph cluster

  • Building the Ceph cluster:
    • Setting time synchronization, host name resolution and password-less login
    • Setting up firewall and SELinux
    • Creating a cluster definition file and auth keys on the deployment node
    • Installation of binaries on cluster nodes
    • Provisioning of MON daemons
    • Copying over the ceph.conf and admin keys to be able to manage the cluster
    • Provisioning of Ceph manager daemons (Ceph Dashboard)
    • Provisioning of OSD daemons
    • Basic configuration

We will cover configuration of KVM nodes in second article.

Let’s start!

On all nodes…

It is critical that the time is properly synchronized across all nodes. If you are running on hypervisor, your VMs might already be synced with the host, otherwise do it the old-fashioned way:

ntpdate -s time.nist.gov
yum install ntp
systemctl enable ntpd
systemctl start ntpd

Make sure each node can resolve the name of each other node –  if not using DNS, make sure to populate /etc/hosts file properly across all 4 nodes (including admin node):

cat << EOM >> /etc/hosts
10.2.2.219 ceph1
10.2.2.116 ceph2
10.2.3.159 ceph3
EOM

On CEPH admin node…

We start by installing ceph-deploy, a tool which we will use to deploy our whole cluster later:

release=mimic
cat << EOM > /etc/yum.repos.d/ceph.repo
[ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-$release/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
EOM
 
yum install ceph-deploy -y

Let’s enable password-less login for root account – generate SSH keys and seed public key into /root/.ssh/authorized_keys file on all Ceph nodes (in production environment, you might want to use a user with limited privileges with sudo escalation):

ssh-keygen -f $HOME/.ssh/id_rsa -t rsa -N ''
ssh-copy-id root@ceph1
ssh-copy-id root@ceph2
ssh-copy-id root@ceph3

On all CEPH nodes…

Before beginning, ensure that SELINUX is set to permissive mode and verify that firewall is not blocking required connections between Ceph components:

firewall-cmd --zone=public --add-service=ceph-mon --permanent
firewall-cmd --zone=public --add-service=ceph --permanent
firewall-cmd --reload
setenforce 0

Make sure that you make SELINUX changes permanent, by editing /etc/selinux.config and setting ‘SELINUX=permissive’

As for the firewall, in case you are using different distribution or don’t consume firewalld, please refer to the networking configuration reference at http://docs.ceph.com/docs/mimic/rados/configuration/network-config-ref/

On CEPH admin node…

Let’s create cluster definition locally on admin node:

mkdir CEPH-CLUSTER; cd CEPH-CLUSTER/
ceph-deploy new ceph1 ceph2 ceph3

This will trigger a ssh connection to each of above referenced Ceph nodes (to check for machine platform and IP addresses) and will then write a local cluster definition and the MON auth key in the current folder.  Let’s check the files generated:

# ls -la
-rw-r--r-- ceph.conf
-rw-r--r-- ceph-deploy-ceph.log
-rw------- ceph.mon.keyring

On Centos7, if you get the “ImportError: No module named pkg_resources” error message while running ceph-deploy tool, you might need to install missing packages:

yum install python-setuptools

In case that you have multiple network interfaces on Ceph nodes, you will be required to explicitly define public network (which accepts client’s connections) – in this case edit previously created ceph.conf on the local admin node to include public network setting:

echo "public network = 10.2.0.0/16" >> ceph.conf

If you only have one NIC in each Ceph node, the above line is not required.

Still on admin node, let’s start the installation of Ceph binaries across cluster nodes (no services started yet):

 ceph-deploy install ceph1 ceph2 ceph3 

Command above will also output the version of Ceph binaries installed on each node – make sure that you did not get a wrong Ceph version installed due to some other repos present (we are installing Mimic 13.2.5, which is latest as of the time of writing).

Let’s create (initial) MONs on all 3 Ceph nodes:

ceph-deploy mon create-initial

In order to be able to actually manage our Ceph cluster, let’s copy over the admin key and the ceph.conf files to all Ceph nodes:

ceph-deploy admin ceph1 ceph2 ceph3

On any CEPH node…

After previous step, you should be able to issue “ceph -s” from any Ceph node, and this will return the cluster health. If you are lucky enough, your cluster will be in HEALTH_OK state, but it might happen that your MON daemons will complain on time mismatch between the nodes, as following:

[root@ceph1 ~]# ceph -w
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_WARN
            clock skew detected on mon.ceph1, mon.ceph3 

In this case, we should stop NTP daemon, force time update (a few times), and start NTP daemon again – and after doing this across all nodes, it would be required to restart Ceph monitors on each node, one by one (give it a few seconds between restart on different nodes) – below we are restarting all Ceph daemons – which effectively means just MONs since we deployed only MONs so far:

systemctl stop ntpd
ntpdate -s time.nist.gov; ntpdate -s time.nist.gov; ntpdate -s time.nist.gov
systemctl start ntpd
systemctl restart ceph.target

After time has been properly synchronized (with less then 0.05 seconds of time difference between the nodes), you should be able to see a cluster in HEALTH_OK state, as below:

[root@ceph1 ~]# ceph -s
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_OK

On CEPH admin node…

Now that we are up and running with all Ceph monitors, let’s deploy Ceph manager daemon (Ceph dashboard, that comes with newer releases) on all nodes since they operate in active/standby configuration (we will configure it later):

ceph-deploy mgr create ceph1 ceph2 ceph3

Finally, let’s deploy some OSDs so our cluster can actually hold some data eventually:

ceph-deploy osd create --data /dev/sdb ceph1
ceph-deploy osd create --data /dev/sdb ceph2
ceph-deploy osd create --data /dev/sdb ceph3

Note in commands above, we reference /dev/sdb as the 100GB volume that is used for OSD.

As mentioned previously, newer versions of Ceph (as in our case) will use by default BlueStore as the storage backend, with (by default) collocating block data and RocksDB key/value database (for managing its internal metadata) on the same device (/dev/sdb in our case). In more complex setups, one can choose to separate RockDB DB on faster devices, while block data will remain on slower devices – somewhat similar with the older FileStore setups, where block data would be located on HDDs/SSDs devices, while Journals would be usually placed on SSD/NVME partitions.

On any CEPH node…

After previous step is done, we should get the output similar to below – confirming that we have a 300GB of space available:

[root@ceph1 ~]# ceph -s
  cluster:
    id:     7f2d23c2-1f2e-4c03-821c-cab3d76f84fc
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph2,ceph1,ceph3
    mgr: ceph1(active)
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   3.0 GiB used, 297 GiB / 300 GiB avail
    pgs:  

Finally, let’s enable the Dashboard manager and set the username/password for authentication (which will be encrypted and stored in monitor’s DB) to be able to access it.
In our lab, we will disable SSL connections and keep it simple – but obviously in production environment, you would want to force SSL connections and also install proper SSL certificate:

ceph config set mgr mgr/dashboard/ssl false
ceph mgr module enable dashboard
ceph dashboard set-login-credentials admin password

Let’s login to the Dashboard manager on the active node (ceph1 in our case, as can be seen in the output from “ceph -s” command above):

And there you go – you now have a working Ceph cluster, which concludes part 1 of this Ceph article series. In part 2 (published soon), we will continue our work by creating a dedicated RBD pool and authentication keys for our CloudStack installation, add Ceph to CloudStack, finally consuming it with dedicated Compute / Disk offerings.

It’s worth mentioning that Ceph itself does provided additional services – i.e. it supports S3 object storage (requires installation / configuration of Ceph Object Gateway) as well as POSIX-compliant file system CephFS (requires installation/configuration of Metadata Server), but for CloudStack, we only need Rados Block Device (RBD) services from Ceph.

About the author

Andrija Panic is a Cloud Architect at ShapeBlue, the Cloud Specialists, and is a committer of Apache CloudStack. Andrija spends most of his time designing and implementing IaaS solutions based on Apache CloudStack.

ShapeBlue SA are pleased to announce the extension of their distribution partner agreement for NetApp in South Africa, building out a successful relationship that started in 2014.

‘ShapeBlue has built a strong partnership with NetApp in this region. Expanding our capabilities to represent the full NetApp portfolio presents a strategic opportunity for us and our partners.’ Says Dan Crowe, Managing Director, ShapeBlue SA.

‘NetApp’s vision, depth of solutions and cloud-centric approach continues to differentiate them. We are seeing a fantastic response, in particular to the Cloud Infrastructure portfolio with HCI and the Cloud Data Services portfolio.’

ShapeBlue, as expert builders of clouds bring a unique insight to both service provider and integrator partners as they develop services, and work with customers on transformation projects.

ShapeBlue believe a new generation of NetApp partners can accelerate strategic initiatives across sectors and harness the true value of data insights.

ShapeBlue will offer SA based partners access to the full NetApp range of solutions, professional services and sales and marketing collaborations.

ShapeBlue have recently expanded office premises in both Cape Town and Johannesburg, with worldwide software engineering now based here in SA. “We’re excited about our newly expanded partnership with NetApp and looking forward to the next step in our evolution.” Concludes Crowe.

About ShapeBlue

ShapeBlue are the leading worldwide independent CloudStack integrator, with offices in London, Bangalore, Rio De Janerio, Mountain View CA, Cape Town and Johannesburg.
Services include consulting, integration, training and infrastructure support

Introduction

ShapeBlue have been working on a new feature for Apache Cloudstack 4.11.1 that will allow users to bypass secondary storage with KVM. The feature introduces a new way to use templates and ISOs, allowing administrators to use them without being cached on secondary storage. Using this approach Cloudstack administrators will not have to worry about massive secondary storage, since it will be simple bypassed, there won’t be any template sitting there waiting. As well it’s bypassing the SSVM since the download task will not be carried on by the SSVM, but the KVM agent itself. This will enable administrators not to spare resources for SSVM, but to use them for commercial purposes. The usual process of virtual machine deployment will stay as before.

Overview

This feature adds a new field in the vm_template table which is called ‘direct_download’. The field will determine if template needs to be downloaded by SSVM (in case of ‘0’), or directly on the host when deploying the VM (in case of ‘1’). CloudStack administrators will have the option to set this field through the UI or API call as described in the following examples:

From the UI:

From Cloudmonkey:

register template zoneid=3e80c1e6-0710-4018-9062-194d6b3bab97 ostypeid=6f232c75-5370-11e8-afb9-06354a01076f hypervisor=KVM url=http://dl.openvm.eu/cloudstack/macchinina/x86_64/macchinina-kvm.qcow2.bz2 format=QCOW2 displaytext=TestMachina name=TestMachina directdownload=true

The same feature applies to ISOs as well – they don’t need to be cached on secondary storage but can be directly downloaded by the host. CloudStack admins have this option available on the API call when registering ISOs and through the UI form as well.

Whenever a VM deployment is started the template will be downloaded on primary storage. The feature actually checks if the template/ISO has been already downloaded on the pool, checking template_spool_ref table. If there’s an entry on the table matching its pool ID and the template ID, then it won’t be downloaded again. The same action applies if the running VM requires the template again (eg. when reinstalling ). Please note that due to the direct download nature of this feature, the uniqueness of the templates across primary storage pools is the responsibility of the CloudStack operator. CloudStack itself can’t detect if the files in a template download URL have changed or not.

Metalinks are also supported for this feature, and administrators can be more flexible in terms of managing their templates as they can set priorities and location preferences in the metalink file. Metalinks are effectively xml that provides URLs for downloading files. The duplicate download locations provide reliability in case one method fails. Some clients also achieve faster download speeds by allowing different chunks/segments of each file to be downloaded from multiple resources at the same time. Please see the following example:

As the example shows, CloudStack administrators can set location preference and priority, which will be considered upon VM deployment. The deployment logic itself introduces a retry mechanism in 2 cases of failures: VM deployment failure and template download failure.

VM deployment retry logic: this will initiate the deployment on a suitable host and will try to deploy it (which includes the template download itself). If the deployment fails for some reason it will retry the deployment on another suitable host.

Template download retry logic: this is part of the VM deployment and will try to download the template/iso directly by the host. If it fails for some reason (e.g. URL not available) it will iterate through the provided list of priority and location. Once download is completed it will execute the checksum validation (if provided), if that one fails it will download it again, until it has made three attempts. If all three attempts unsuccessful it will return a deployment failure and go back to VM Deployment logic.

Please see the following simplified picture of the deployment logic:

Since the download task has been delegated to the KVM agent instead of SSVM, this feature will be available only for KVM templates.

About the author

Boris Stoyanov is Software Engineer in testing at ShapeBlue, The Cloud Specialists. Bobby spends his time testing features for the Apache CloudStack Community and for our ShapeBlue clients.

Last year we had a project which required us to build out a KVM environment which used shared storage. Most often that would be NFS all the way and very occasionally Ceph.   This time however the client already had a Fibre Channel over Ethernet (FCoE) SAN which had to be used, and the hosts were HP blades using shared converged adaptors in the chassis- just add a bit more fun.

A small crowbar and a large hammer later, the LUNs from the SAN were being presented to the hosts. So far so good.  But…

Clustered File Systems

If you need to have a volume shared between two or more  hosts, you can provision the disk to all the machines, and everything might appear to work, but each host will be maintaining its own inode table and so will be unaware of changes other hosts are making to the file system, and in the event that writes ever happened to the same areas of the disk at the same time you will end up with data corruption. The key is that you need a way to track locks from multiple nodes.  This is called a Distributed Locking Manager or DLM and for this you need a Clustered File System.

Options

There are dozens of clustered file systems out there, proprietary and open source.
For this project we needed a file system which;

  • Supported on CentOS6.7
  • Open source
  • Supports multi-path
  • Easy to configure not a complex group of Distributed Parallel Filesystems
  • Need to support concurrent file access and deliver the utmost performance
  • No management node over head, so more cluster drive space.

So we opted for OCFS2 (Oracle Clustered File System 2)

Once you have the ‘knack’, installation isn’t that arduous, and it goes like this…

These steps should be repeated on each node.

1. Installing the OCFS file system binaries

In order to use OCFS2, we need to install the kernel modules and OCFS2-tools.

First we need to download and install the OCFS2 kernel modules for CentOS 6.  Oracle now bundles the OCFS2 kernel modules in its Unbreakable Kernel, but they also used to be shipped with CloudStack 3.x so we used those.

wget http://shapeblue.s3.amazonaws.com/ocfs2-kmod-1.5.0-1.el6.x86_64.rpm"
rpm -i ocfs2-kmod-1.5.0-1.el6.x86_64.rpm 

Next we copy the OCFS2 kernel modules into the current running kernel directory for CentOS 6.7

cp -Rpv /lib/modules/2.6.32-71.el6.x86_64/extra/ocfs2/ /lib/modules/2.6.32-573.3.1.el6.x86_64/extra/ocfs2

Next we update the running kernel with the newly installed modules.

depmod –a

Add the Oracle yum repo for el6 (CentOS 6.7) for the OCFS2-tools

cd /etc/yum.repos.d
wget http://public-yum.oracle.com/public-yum-ol6.repo" 

And add the PKI keys for the Oracle el6 YUM repo

cd /etc/pki/rpm-gpg/
wget http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6
rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle-ol6 

Now we can install the OCFS2 tools to be used to administrate the OCFS2 Cluster.

yum install -y ocfs2-tools

Finally we add the OCFS2 modules into the init script to load OCFS2 at boot.

sed -i "/online \"\$1\"/a\/sbin\/modprobe \-f ocfs2\nmount\-a" /etc/init.d/o2cb

2. Configure the OCFS2 Cluster.

OCFS2 cluster nodes are configured through a file (/etc/ocfs2/cluster.conf). This file has all the settings for the OCFS2 cluster. An example configuration file might look like this:

cd /etc/ocfs2/
vim cluster.conf

node:
ip_port = 7777
ip_address = 192.168.100.1
number = 0
name = host1.domain.com
cluster = ocfs2

node:
ip_port = 7777
ip_address = 192.168.100.2
number = 1
name = host2.domain.com
cluster = ocfs2

node:
ip_port = 7777
ip_address = 192.168.100.3
number = 2
name = host3.domain.com
cluster = ocfs2

cluster:
node_count = 3
name = ocfs2

We will need to run the o2cb service from the /etc/init.d/ directory to configure the OCFS2 cluster.

/etc/init.d/o2cb configure
Load O2CB driver on boot (y/n) [y]: y

Cluster stack backing O2CB [o2cb]: ENTER
Cluster to start on boot (Enter "none" to clear) [ocfs2]: ENTER
Specify heartbeat dead threshold (=7) [31]: ENTER
Specify network idle timeout in ms (=5000) [30000]: ENTER
Specify network keepalive delay in ms (=1000) [2000]: ENTER
Specify network reconnect delay in ms (=2000) [2000]: ENTER

Update the iptables rules to allow the OCFS2 Cluster port 7777 on all the nodes that we have installed:

iptables -I INPUT -p udp -m udp --dport 7777 -j ACCEPT
iptables -I INPUT -p udp -m udp --dport 7777 -j ACCEPT
iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 7777 -j ACCEPT
iptables-save >> /etc/sysconfig/iptables
Restart the iptables service
service iptables restart 

3. Setting up Linux file system

First we create a directory where the OCFS2 system will be mounted.

mkdir –p /san/primary/ 

We need to format the mounted volume as OCFS2. This only needs to be run on ONE of the nodes in the cluster.

mkfs.ocfs2 -L OCFS2_label -T vmstore --fs-feature-level=max-compat /dev/sdd -N (number of nodes +1) 

The options work like this:
-L Is the Label of the OCFS2 cluster
-T What will the cluster be used for, type of Data
-fs-feature-level making OCFS2 compatible with older versions

4. Update the Linux FSTAB with the OCFS2 drive settings.

Next we had the following line to /etc/fstab to mount the volume at every boot.

 /dev/sdd /san/primary _netdev,nointr 0 0

5. Mount the OCFS2 cluster.

Once the fstab has been updated we’ll need to mount the volume

mount -a

This will give us a mount point on each node in this cluster of /san/primary. This mount point is backed by the same LUN in the SAN, but most importantly the filesystem is aware that there are multiple hosts connected to it and will lock files accordingly.

Each cluster of hosts would have a specific LUN (or LUNs) which is would connect to.  It makes life a lot simpler if you are able to mask the LUNs from SAN such that only the hosts which will connect to a specific LUN can see that LUN, as this helps to avoid any mix ups.

Adding this storage into CloudStack

In order for the KVM hosts to utilise this storage in a CloudStack context, we must add the shared LUNs as primary storage in CloudStack. This is done by setting the storage type to ‘presetup – SharedMountPoint’ when adding the primary storage pools for these clusters.  The mountpoint path should be specified in the way that they will be seen locally by the KVM hosts; in this case – /san/primary.

Summary

In this article we looked at the requirement for a Clustered File System when connecting KVM hosts to a SAN and how to configure OCFS2 on CentOS6.7

 

About The Authors

Glenn Wagner is  a Senior Consultant / Cloud Architect at ShapeBlue, The Cloud Specialists. Glenn spends most of his time designing and implementing IaaS solutions based on on Apache CloudStack.

Paul Angus is VP Technology & Cloud Architect at ShapeBlue. He has designed and implemented numerous CloudStack environments for customers across 4 continents, based on Apache CloudStack.
Some say; that when not building Clouds, Paul likes to create Ansible playbooks that build clouds. And that he’s actually read A Brief History of Time.

Paul Angus, Cloud Architect at ShapeBlue takes an interesting look at how to separate Cloudstack’s management traffic from its primary storage traffic.

I recently  looked at physical networking in a CloudStack environment and alluded to the fact that you cannot separate primary storage traffic from management traffic from CloudStack, but that it is still possible. In this article I will discuss why this is and how to do it.

 In the beginning, there was primary storage

The first thing to understand is the process of provisioning primary storage. When you create a primary storage pool for any given cluster, the CloudStack management server tells each hosts’ hypervisor to mount the NFS share or (iSCSI LUN). The storage pool will be presented within the hypervisor as a datastore (VMware), storage repository (XenServer/XCP) or a mount point (KVM), the important point is that it is the hypervisor itself that communicates with the primary storage, the CloudStack management server only communicates with the host hypervisor.

Now, all hypervisors communicate with the outside world via some kind of management interface – think VMKernel port on ESXi or ‘Management Interface’ on XenServer. As the CloudStack management server needs to communicate with the hypervisor in the host, this management interface must be on the CloudStack ‘management’ or ‘private’ network. There may be other interfaces configured on your host carrying guest and public traffic to/from VMs within the hosts but the hypervisor itself doesn’t/can’t communicate over these interfaces.

hypervisorcomms
Figure 1: Hypervisor communications

Separating Primary Storage traffic

For those from a pure virtualisation background, the concept of creating a specific interface for storage traffic will not be new; it has long been best practice for iSCSI traffic to have a dedicated switch fabric to avoid any latency or contention issues.

Sometimes in the cloud(Stack) world we forget that we are simply orchestrating processes that the hypervisors already carry out and that many ‘normal’ hypervisor configurations still apply.

The logical reasoning which explains how this splitting of traffic works is as follows:

1. If you want an additional interface over which the hypervisor can communicate (excluding teamed or bonded interfaces) you need to give it an IP address

2. The mechanism to create an additional interface that the hypervisor can use is to create an additional management interface

3. So that the hypervisor can differentiate between the management interfaces they have to be in different (non-overlapping) subnets

4. In order for the ‘primary storage’ management interface to communicate with the primary storage, the interfaces on the primary storage must be in the same CIDR as the ‘primary storage’ management interface.

5. Therefore the primary storage must be in a different subnet to the management network

subnetting storage

Figure 2: Subnetting of Storage Traffic

hypervisorcomms-secstorage

Figure 3: Hypervisor Communications with Separated Storage Traffic

Other Primary Storage Types

If you are using PreSetup or SharedMountPoints to connect to IP based storage then the same principles apply; if the primary storage and ‘primary storage interface’ are in a different subnet to the ‘management subnet’ then the hypervisor will use the ‘primary storage interface’ to communicate with the primary storage.

Summary

This article has explained the how primary storage traffic can be routed over separate network interfaces from all other traffic on the hosts by adding a management interface on the host for storage and allocating it and the primary storage IP addresses in a different subnet to the CloudStack management subnet.

About the Author

Paul Angus is a Cloud Architect at ShapeBlue, The Cloud Specialists. He has designed numerous CloudStack environments for customers across 4 continents, based on Apache Cloudstack ,Citrix Cloudplatform and Citrix Cloudportal.

 

When not building Clouds, Paul likes to create scripts that build clouds

..and he very occasionally can be seen trying to hit a golf ball.