Deprecating libvirt / KVM hypervisor versions in OpenStack Nova

Posted: May 18th, 2015 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: , , | 1 Comment »

If you read nothing else, just take note that in the Liberty release cycle Nova has deprecated usage of libvirt versions < 0.10.2, and in the Mxxxxx release cycle support for running with libvirt < 0.10.2 will be explicitly dropped.

OpenStack has a fairly aggressive policy of updating the minimum required versions of python modules it depends on. Many python modules are updated pretty frequently and (bitter) experience has shown that updates will often not be API compatible, even across seemingly minor version number changes. Maintaining working OpenStack code across different incompatible versions of the same module is tricky to get right and will inevitably be fragile without good testing coverage. While OpenStack has a huge level of testing, it cannot reasonably be expected to track the matrix of different incompatible python module versions. So (reluctantly) I accept that OpenStack has chosen the only practical approach, which is the increase the min required version of a library anytime it is found to be incompatible with an older version, or whenever there is a new feature required that is only present in a newer version. Now this does create pain in that the versions of python modules shipped in most distros are going to be too old to satisfy OpenStack’s needs. Thus when deploying OpenStack the distro provided versions must be updated to something newer. Fortunately, most OpenStack deployment tools mitigate the pain for users by taking ownership of installation and management of the full python stack, whether a 3rd party module, or an OpenStack provided module and this works pretty well in general.

It is important to contrast this with the situation found for dependencies on non-python modules, and in particular for Nova, the hypervisor platform that is targeted. While OpenStack does get some testing coverage of the hypervisor control plane, it is inconsequential when placed in the context of testing done by the hypervisor vendors themselves. The vendors will of course have tested the control plane themselves, both directly and often in the context of higher level apps such as oVirt and OpenStack. Beyond that though, the vendors will test a whole suite of guest operating systems to ensure they deploy and operate in a functionally correct manner. For Windows guests, there will be certifications of accelerated guest drivers via WHQL and the OS as a whole with Microsoft’s SVVP. The vendor will benchmark and validate scalability and performance the hypervisor on a multitude of compute workloads, and against various different storage and network technologies. For government related deployments, the platform will go through Common Criteria Certifications and security audits. Finally of course, the vendor will have a team of people maintaining the version they ship, most critically of course, to deal with security errata. I should note that I’m thinking about Open Source hypervisors primarily here and the difference between upstream releases and productized downstream releases. For closed source hypervisors you only ever get access to the productized release.

This is all a long winded way of saying that it is a very hard sell for OpenStack to require users to update their hypervisor versions to something OpenStack has tested, in preference to the version that the vendor ordinarily ships & supports.  The benefit of OpenStack’s testing of the hypervisor control plane does not come anywhere close to offsetting the costs of loosing the testing, certification & support work that the vendor has put onto the hypervisor platform as a whole. There are also costs suffered directly by the user wrt platform upgrades, as distinct from application upgrades. It is fairly common for organizations to go through their own internal build and certification process when deploying a new operating system and/or hypervisor platform. This will include jobs such as integrating with their network services, particularly authentication & authorization engines, service monitoring frameworks, auditing systems and backups services. In addition the OS/hypervisor is also likely to undergo testing against any hardware platforms/models that the organization may have standardized on. It may take as long as 6 months, or even 12, before some organizations are ready to deploy a new hypervisor platform released by a vendor. Once an organization has deployed a platform, they will naturally wish to maximise its useful lifetime before upgrading to newer versions. This is in stark contrast to applications that an organization runs on the platforms which may be upgraded very frequently in matter of weeks or even days. It is sad that there can be such time lags for platform but not applications, but unfortunately this is just the way many organizations IT support works.

For these reasons, OpenStack needs to take a different approach to hypervisor platforms, and be pretty conservative about updating the minimum required version. The costs on users will be quite large and not something that can be mitigated by deployment tools that OpenStack can provide, unless the organization is one of the minority that is nimble enough to cope with a continuous deployment model and has enough in house expertise to take on a degree of hypervisor maintenance. In cases where Nova does wish to update the minimum required version there needs to be a fairly compelling set of benefits to Nova that outweigh the costs that will be imposed on the downstream users. Mere prettiness / cleanliness of the code is exceedingly unlikely to count as a compelling benefit.

Looking specifically at the Libvirt + KVM platform dependency in Nova, back in November 2013 we increased the minimum required libvirt from 0.9.6 to 0.9.11. This had the cost of dropping the ability to run Nova on the (then current) Ubuntu LTS platform. This cost was largely mitigated by the fact that Canonical provide the Cloud Archive add-on repository which ships newer libvirt and KVM versions specifically for use with OpenStack, so users had an easy way out in that case. The compelling benefit to Nova though, was that it enabled OpenStack to depend on the new libvirt-python module that had been split off from the main libvirt package and made available on PyPi. This made it possible for OpenStack testing to setup virtualenvs with specific libvirt python versions in common with its approach for any other python modules. More importantly this new libvirt-python has support for the Python 3 platform, so unblocking that porting item for Nova. As a result, the upgrade from 0.9.6 to 0.9.11 was a clear net win on balance.

The benefit of increasing the min required libvirt to values beyond 0.9.11 is harder to articulate. It would enable removal of a few workarounds in OpenStack but nothing that is imposing an undue burden on Nova Libvirt driver maintenance at this time. Mostly the problem with older versions is that they simply lack a lot of functionality compared to current versions, so there will be an increasingly large set of OpenStack features which will not work at all on such old versions. They also get comparatively less testing by developers, vendors and users alike, so as time goes by we’re less likely to spot incompatibilities with old versions which will ultimately affect the experience users have when deploying OpenStack. It is less clear cut when to the draw the line though, in these cases. To help guide our decision making, a list of currently shipping libvirt, kvm and libguestfs versions across distros is maintained. For the community focused distros with short lifetimes (short == less than 2 years from release to end-of-life), it is quite simple to just drop them as supported targets when they go end of life. So from the POV of Fedora, at time of writing, we’ll only care about Nova supporting Libvirt >= 1.1.3. For the enterprise focus distros with long lifetimes (long == more than 2 years, often 5-10 years), it is hard to decide when to drop them as a supported target. As mentioned earlier, enterprise organizations will typically have quite a time lag between a new release coming out and it being something that is widely deployed. Despite RHEL-7 having been available since June 2014, it is not uncommon for organizations to still be using RHEL-6 for new platform deployments. Officially, RHEL-6 is a supported platform by Red Hat until at least 2020, but clearly Nova will not wish to continue targeting it for that length of time. So there is a question of when it is reasonable for Nova to end support for the RHEL-6 platform. Nova already dropped support for Python 2.6, so RHEL-6 users will need to use the Software Collections Layer to get Python 2.7 access, and Red Hat’s OpenStack product is now RHEL-7 based only, so clearly Nova on RHEL-6 is entering its twilight years.

Looking at the current distro support matrix for libvirt versions it was decided that support for Debian Wheezy and OpenSuse 12.2 was reasonable to drop, but at this time Nova will continue to support RHEL-6 vintage libvirt. To provide users with greater advance notice it was agreed that dropping of libvirt/kvm versions should require issuance of a deprecation warning for one release cycle.. So in the Liberty release, Nova will now print out a warning if run on libvirt < 0.10.2, and in the Mxxxx release cycle this will turn into a fatal error. So anyone currently deployed on libvirt 0.9.11 -> 0.10.1 has advance warning to plan for an upgrade of their hypervisor platform. I suspect that RHEL-6 may well get the chop 1 cycle later, eg we’d issue a warning in Mxxx and drop it in Nxxxx release, as RHEL-7 would have been available for 2 years by that point and should be taking the overwhealming majority of KVM hypervisor deployments.

One of the things to come out of the discussion around incrementing the libvirt minimum version was that we haven’t really articulated what our policy is in this area. As one of the lead maintainers of the Nova libvirt driver, this blog post is an attempt to set out my views of the matter. As you can see there is no simple answer, but the intent is to be as conservative as practical to minimize the number of users who are likely to be impacted by decisions to increase the minimum version. Is also became clear that we need to do a better job of articulating our approach to required platform versions to users in documentation. Previously there had been an attempt to categorize Nova hypervisor platforms/drivers into three groups, primarily according to the level of testing they have in the OpenStack or 3rd party CI systems. The intention behind this is fine, but the usefulness to users is somewhat limited because OpenStack CI obviously only tests a handful of very specific hypervisor platforms. So this classification gives you confidence that a Nova driver has been tested, but not confidence that it has been tested with your particular versions. So functionality that OpenStack claims is tested & operational may not be available on your platform due to version differences. To address this, OpenStack needs to provide more detailed information to users, in particular it must distinguish between what versions of a hypervisor Nova is technically capable of running against, vs the versions of a hypervisor that have been validated by CI. Armed with this knowledge, where those versions differ, it is reasonable for the user to look to their hypervisor vendor for confirmation that their own testing can provide an equivalent level of assurance to the OpenStack CI testing. The user also has the option of running the OpenStack CI tests themselves against their own specific deployment platform. On the theme of providing users with more information about hypervisor capabilities, the Nova feature support matrix which was previously held in a wiki has been turned into a piece of formal documentation maintained in Nova itself. The intent is to continue to expand this to provide more fine grained information about features and eventually annotate them with any caveats about minimum required versions of the hypervisor in the associated notes for each feature item.