Libvirt: use of GCC/Clang extension for automatic cleanup functions

Posted: January 31st, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | No Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon in October 2019.

Automatic memory cleanup

Libvirt has always aimed to be portable across a wide set of operating system platforms and this included portability to different compiler toolchains. In the early days of the project GCC was the most common target, but users did use the Solaris and Microsoft native compilers occasionally. Fast forward to today and the legacy UNIX platforms are much less relevant. Officially libvirt only targets Linux, FreeBSD, macOS and Windows as supported platforms and all of these have GCC or CLang or both available. These compilers are available on any platform that we’re likely to add in the future too. Conceivably people might still want to use Microsoft compilers, but their featureset is so poor compared to GCC/Clang that we long ago discounted them as a toolchain to support.

Thus libvirt in the early part of last year, libvirt made the explicit decision to only support GCC and CLang henceforth. This in turn freed the project to take full advantage of extensions to the C language offered by these compilers.

The extension which motivated this decision was the cleanup attribute. This allows a variable declaration to have a function associated with it that will be automatically invoked when the variable goes out of scope. The most obvious use for these cleanup functions is to release heap memory associated with pointers, and this is exactly what libvirt wanted to do. This is not the only use case though, they are also convenient for other tasks such as closing file descriptors, decrementing reference counts, unlocking mutexes, and so on.

The native C syntax for using this feature is fairly ugly

__attribute__((__cleanup__(free))) char *foo = NULL;

but this can be made more attractive via macros. For example, GLib provides several pretty macros to simplify life g_autofree, g_autoptr and g_auto.

Thus the old libvirt coding pattern of

void dosomething(char *param) {
  char * foo;

  ...some code...

  foo = g_strdup_printf("Some string %s", param);
  if (something() < 0)
     goto cleanup;

  ... some more code... 

cleanup:
  free(foo);
}

Can be replaced by something like

void dosomething(char *param) {
  g_autofree(char *) foo = NLL;

  ...some code...

  foo = g_strdup_printf("Some string %s", param);
  if (something() < 0)
     return;

  ... some more code... 
}

There are still some “gotchas” to be aware of. Care must be taken to ensure any variable declared with automatic cleanup is always initialized, otherwise the cleanup function will touch uninitialized stack data. If a pointer stored in an automatic cleanup variable needs to be returned to the caller of the method, the local variable must be NULLd out. Fortunately GLib provides a convenient helper g_steal_pointer for exactly this purpose.

The previous blog described how many goto jumps were eliminated by aborting on OOM, instead of trying to gracefully cleanup & report it. The remaining goto jumps were primarily for free’ing memory, closing file descriptors, and releasing mutexes, most of which can be eliminated with these cleanup functions.

The result is that the libvirt code can be dramatically simplified, which reduces the maint burden on libvirt contributors, allowing more time to be spent on coding features which matter to users. As an added benefit, in converting code over to use automatic cleanup functions we’ve fixed a number of memory leaks not previously detected, which reinforces the value of using this C extension.

Incidentally after this was introduced in libvirt last year, I suggested that QEMU also adopt use of automatic cleanup functions, since it has also mandated either GCC or CLang as the only supported compilers, and this was accepted.

Libvirt: adoption of GLib library to replace GNULIB & home grown code

Posted: January 30th, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | No Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Portability and API abstractions

Libvirt traditionally targeted the POSIX standard API but there are a number of difficulties with this. Much of POSIX is optional so can not be assumed to exist on every platform. Plenty of platforms are non-compliant with the spec, or have different behaviour in scenarios where the spec allowed for multiple interpretations. To deal with this libvirt used the GNULIB project which is a  copylib that attempt to fix POSIX non-compliance issues. It is very effective at this, but portability is only one of the problems with using POSIX APIs directly. It is a very low level API, so simple tasks like listening on a TCP socket require many complex API calls. Other APIs have poor designs by modern standards which make it easy for developers to introduce bugs. The malloc APIs are a particular case in point. As a result libvirt has created many higher level abstractions around the POSIX APIs. Looking at other modern programming languages though, such higher level abstractions are already a standard offering. This allows developers to focus on solving their application’s domain specific problems. Libvirt maintainers by contrast have spent a lot of time developing abstractions unrelated to virtualization such as object / class systems, DBus client APIs, hash tables / bitmaps, sockets / RPC systems, and much more. This is not a good use of limited resources in the long term.

Adoption of GLib

These problems are common to many applications / libraries that are written in C and thus there are a number of libraries that attempt to provide a high level “standard library”. The GLib library is one such effort from the GNOME project developers that has long been appealing. Some of libvirt’s internal APIs are inspired by those present in GLib, and it has been used by QEMU for a long time too. What prevented libvirt from using GLib in the past was the desire to catch, report and handle OOM errors. With the switch to aborting on OOM, the only blocker to use of GLib was eliminated.

The decision was thus made for libvirt to adopt the GLib library in the latter part of 2019. From the POV of application developers nothing will change in libvirt. The usage of GLib is purely internal, and so doesn’t leak into public API exposed from libvirt.so, which is remains compatible with what came before. In the case of QEMU/KVM hosts at least, there is also no change in what must be installed on hosts, since GLib was already a dependency of QEMU for many years. This will ultimately be a net win, as using GLib will eliminate other code in libvirt, reducing the installation footprint on aggregate between libvirt and QEMU.

With a large codebase such as libvirt’s, adopting GLib is a not as quick as flicking a switch. Some key pieces of libvirt functionality have been ported to use GLib APIs completely, while in other cases the work is going to be an incremental ongoing effort over a long time. This offers plenty of opportunities for new contributors to jump in and make useful changes which are fairly easily understood & straightforward to implement.

Removal of GNULIB

One of the anticipated benefits of using GLib was that it would be able to obsolete a lot of the portability work that GNULIB does. The GNULIB project is strongly entangled with autotools as a build system, so is a blocker to the adoption of a different build system in libvirt. There has thus been an ongoing effort to eliminate GNULIB modules from libvirt code. In many cases, GLib does indeed provide a direct replacement for the functionality needed. One of the surprises though, is that a very large portion of GNULIB was completely redundant given libvirt’s stated set of OS platform build targets. There is no need to consider portability to a wide variety of old buggy UNIX variants (Solaris, HPUX, AIX, and so on) for libvirt. After a final big push over the last few weeks, a patch series has been posted which completes the removal of GNULIB from libvirt, which will merge in the 6.1.0 release.

The work has been tested across all the platforms covered by libvirt CI, which means RHEL-7, 8, Fedora 30, 31, rawhide, Ubuntu 16.04, 18.04, Debian 9, 10, sid, FreeBSD 11, 12, macOS 10.14 with XCode 10.3 and XCode 11.3, and MinGW64. There are certainly libvirt users on platforms not covered by CI. Those using other modern Linux distros should not see problems if using GLibC, as the combination of RHEL, Debian & Ubuntu testing should expose any latent bugs. The more likely places to see regressions will be if people are using libvirt on other *BSDs, or older Linux distros. Usage of alternative C library implementations on Linux is also an unknown, since there is no CI coverage for this. Support for older Linux distros is explicitly not a goal for libvirt and the project will willingly break old platforms. Support for other modern non-Linux OS, however, is potentially interesting. What is stopping such platforms being considered explicitly by libvirt is lack of any contributors willing to help provide a CI setup and deal with fixing portability problems. IOW, libvirt is willing to entertain the idea of supporting additional modern OS platforms if contributors want to work with the project to make it happen. The same applies to Linux distros using a non-GLibC impl.

Libvirt: abort() when seeing ENOMEM errors

Posted: January 29th, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | 2 Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Detecting and reporting OOM

Libvirt has always taken the view that ANY error from a function / system call must be propagated back to the caller. The out of memory condition (ENOMEM / OOM) is just one of many errors that might be seen when calling APIs, and thus libvirt attempted to report this in the normal manner. OOM is not like most other errors though.

The first challenge with OOM is that checking for a NULL return from malloc() is error prone because the return value overloads the error indicator with the normal returned pointer. To address this libvirt coding style banned direct use of malloc() and created a wrapper API that returned the allocated pointer in an output parameter, leaving the return value solely as the error indicator leading to a code pattern like:

  char *varname;

  if (VIR_ALLOC(varname) < 0) {

  ....handle OOM...

  }

This enabled use of the ‘return_check‘ function attribute to get compile time validation that allocation errors were checked. Checking for OOM is only half the problem. Handling OOM is the much more difficult issue. Libvirt uses a ‘goto error‘ design pattern for error cleanup code paths. A surprisingly large number of these goto jumps only exist to handle OOM cleanup. Testing these code paths is non-trivial, if not impossible, in the real world. Libvirt integrated a way to force OOM on arbitrary allocations in its unit test suite. This was very successful at finding crashes and memory leaks in OOM handling code paths, but this only validates code that actually has unit test coverage. The number of bugs found in code that was tested for OOM, gives very low confidence that other non-tested code would correctly handle OOM. The OOM testing is also incredibly slow to execute since it needs to repeatedly re-run the unit tests failing a different malloc() each time. The time required grows exponentially as the number of allocations increases.

Assuming the OOM condition is detected and a jump to the error handling path is taken, there is now the problem of getting the error report back to the user. Many of the libvirt drivers run inside the libvirtd daemon, with an RPC system used to send results back to the client application. Reporting the error via RPC messages is quite likely to need memory allocation which may well fail in an OOM scenario.

Is OOM reporting useful?

The paragraphs above describe why reporting OOM scenarios is impractical, verging on impossible, in the real world. Assuming it was possible to report reliably though, would it actually benefit any application using libvirt ?

Linux systems generally default to having memory overcommit enabled, and when they run out of memory, the OOM killer will reap some unlucky process. IOW, on Linux, it is very rare for an application to ever see OOM reported from an allocation attempt. Libvirt is ported to non-Linux platforms which may manage memory differently and thus genuinely report OOM from malloc() calls. Those non-Linux users will be taking code paths that are never tested by the majority of libvirt users or developers. This gives low confidence for success.

Although libvirt provides a C library API as its core deliverable, few applications are written in C, most consume libvirt via a language binding with Perl and Go believed to be the most commonly used. Handling OOM in non-C languages is even less practical/common than in C. Many libvirt applications are also already using libraries (GTK, GLib) that will abort on OOM. Overall there is little sign that any libvirt client application attempts to handle OOM in its own code, let alone care if libvirt can report it.

One important application process using the libvirt API though is the libvirtd daemon. In the very early days, if libvirtd stopped it would take down all running QEMU VMs, but this limitation was fixed over 10 years ago. To enable software upgrades on hosts with running VMs, libvirtd needs to be able to restart itself. As a result libvirtd maintains a record of important state on disk enabling it to carry on where it left off when starting up. Recovering from OOM by aborting and allowing the libvirtd to be restarted by systemd, would align with a code path that already needs to be well tested and supported for software upgrades.

Give up on OOM handling

With all the above in mind, the decision shouldn’t be a surprise. Libvirt has decided to stop attempting to handle ENOMEM from malloc() and related APIs and will instead immediately abort. The libvirtd daemon will automatically restart and carry on where it left off. The result is that the libvirt code can be dramatically simplified by removing many goto jump and cleanup code blocks, which reduces the maint burden on libvirt contributors, allowing more time to be spent on coding features which matter to users.