Libvirt: adoption of GLib library to replace GNULIB & home grown code

Posted: January 30th, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | No Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Portability and API abstractions

Libvirt traditionally targeted the POSIX standard API but there are a number of difficulties with this. Much of POSIX is optional so can not be assumed to exist on every platform. Plenty of platforms are non-compliant with the spec, or have different behaviour in scenarios where the spec allowed for multiple interpretations. To deal with this libvirt used the GNULIB project which is a  copylib that attempt to fix POSIX non-compliance issues. It is very effective at this, but portability is only one of the problems with using POSIX APIs directly. It is a very low level API, so simple tasks like listening on a TCP socket require many complex API calls. Other APIs have poor designs by modern standards which make it easy for developers to introduce bugs. The malloc APIs are a particular case in point. As a result libvirt has created many higher level abstractions around the POSIX APIs. Looking at other modern programming languages though, such higher level abstractions are already a standard offering. This allows developers to focus on solving their application’s domain specific problems. Libvirt maintainers by contrast have spent a lot of time developing abstractions unrelated to virtualization such as object / class systems, DBus client APIs, hash tables / bitmaps, sockets / RPC systems, and much more. This is not a good use of limited resources in the long term.

Adoption of GLib

These problems are common to many applications / libraries that are written in C and thus there are a number of libraries that attempt to provide a high level “standard library”. The GLib library is one such effort from the GNOME project developers that has long been appealing. Some of libvirt’s internal APIs are inspired by those present in GLib, and it has been used by QEMU for a long time too. What prevented libvirt from using GLib in the past was the desire to catch, report and handle OOM errors. With the switch to aborting on OOM, the only blocker to use of GLib was eliminated.

The decision was thus made for libvirt to adopt the GLib library in the latter part of 2019. From the POV of application developers nothing will change in libvirt. The usage of GLib is purely internal, and so doesn’t leak into public API exposed from libvirt.so, which is remains compatible with what came before. In the case of QEMU/KVM hosts at least, there is also no change in what must be installed on hosts, since GLib was already a dependency of QEMU for many years. This will ultimately be a net win, as using GLib will eliminate other code in libvirt, reducing the installation footprint on aggregate between libvirt and QEMU.

With a large codebase such as libvirt’s, adopting GLib is a not as quick as flicking a switch. Some key pieces of libvirt functionality have been ported to use GLib APIs completely, while in other cases the work is going to be an incremental ongoing effort over a long time. This offers plenty of opportunities for new contributors to jump in and make useful changes which are fairly easily understood & straightforward to implement.

Removal of GNULIB

One of the anticipated benefits of using GLib was that it would be able to obsolete a lot of the portability work that GNULIB does. The GNULIB project is strongly entangled with autotools as a build system, so is a blocker to the adoption of a different build system in libvirt. There has thus been an ongoing effort to eliminate GNULIB modules from libvirt code. In many cases, GLib does indeed provide a direct replacement for the functionality needed. One of the surprises though, is that a very large portion of GNULIB was completely redundant given libvirt’s stated set of OS platform build targets. There is no need to consider portability to a wide variety of old buggy UNIX variants (Solaris, HPUX, AIX, and so on) for libvirt. After a final big push over the last few weeks, a patch series has been posted which completes the removal of GNULIB from libvirt, which will merge in the 6.1.0 release.

The work has been tested across all the platforms covered by libvirt CI, which means RHEL-7, 8, Fedora 30, 31, rawhide, Ubuntu 16.04, 18.04, Debian 9, 10, sid, FreeBSD 11, 12, macOS 10.14 with XCode 10.3 and XCode 11.3, and MinGW64. There are certainly libvirt users on platforms not covered by CI. Those using other modern Linux distros should not see problems if using GLibC, as the combination of RHEL, Debian & Ubuntu testing should expose any latent bugs. The more likely places to see regressions will be if people are using libvirt on other *BSDs, or older Linux distros. Usage of alternative C library implementations on Linux is also an unknown, since there is no CI coverage for this. Support for older Linux distros is explicitly not a goal for libvirt and the project will willingly break old platforms. Support for other modern non-Linux OS, however, is potentially interesting. What is stopping such platforms being considered explicitly by libvirt is lack of any contributors willing to help provide a CI setup and deal with fixing portability problems. IOW, libvirt is willing to entertain the idea of supporting additional modern OS platforms if contributors want to work with the project to make it happen. The same applies to Linux distros using a non-GLibC impl.

Libvirt: abort() when seeing ENOMEM errors

Posted: January 29th, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | 2 Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Detecting and reporting OOM

Libvirt has always taken the view that ANY error from a function / system call must be propagated back to the caller. The out of memory condition (ENOMEM / OOM) is just one of many errors that might be seen when calling APIs, and thus libvirt attempted to report this in the normal manner. OOM is not like most other errors though.

The first challenge with OOM is that checking for a NULL return from malloc() is error prone because the return value overloads the error indicator with the normal returned pointer. To address this libvirt coding style banned direct use of malloc() and created a wrapper API that returned the allocated pointer in an output parameter, leaving the return value solely as the error indicator leading to a code pattern like:

  char *varname;

  if (VIR_ALLOC(varname) < 0) {

  ....handle OOM...

  }

This enabled use of the ‘return_check‘ function attribute to get compile time validation that allocation errors were checked. Checking for OOM is only half the problem. Handling OOM is the much more difficult issue. Libvirt uses a ‘goto error‘ design pattern for error cleanup code paths. A surprisingly large number of these goto jumps only exist to handle OOM cleanup. Testing these code paths is non-trivial, if not impossible, in the real world. Libvirt integrated a way to force OOM on arbitrary allocations in its unit test suite. This was very successful at finding crashes and memory leaks in OOM handling code paths, but this only validates code that actually has unit test coverage. The number of bugs found in code that was tested for OOM, gives very low confidence that other non-tested code would correctly handle OOM. The OOM testing is also incredibly slow to execute since it needs to repeatedly re-run the unit tests failing a different malloc() each time. The time required grows exponentially as the number of allocations increases.

Assuming the OOM condition is detected and a jump to the error handling path is taken, there is now the problem of getting the error report back to the user. Many of the libvirt drivers run inside the libvirtd daemon, with an RPC system used to send results back to the client application. Reporting the error via RPC messages is quite likely to need memory allocation which may well fail in an OOM scenario.

Is OOM reporting useful?

The paragraphs above describe why reporting OOM scenarios is impractical, verging on impossible, in the real world. Assuming it was possible to report reliably though, would it actually benefit any application using libvirt ?

Linux systems generally default to having memory overcommit enabled, and when they run out of memory, the OOM killer will reap some unlucky process. IOW, on Linux, it is very rare for an application to ever see OOM reported from an allocation attempt. Libvirt is ported to non-Linux platforms which may manage memory differently and thus genuinely report OOM from malloc() calls. Those non-Linux users will be taking code paths that are never tested by the majority of libvirt users or developers. This gives low confidence for success.

Although libvirt provides a C library API as its core deliverable, few applications are written in C, most consume libvirt via a language binding with Perl and Go believed to be the most commonly used. Handling OOM in non-C languages is even less practical/common than in C. Many libvirt applications are also already using libraries (GTK, GLib) that will abort on OOM. Overall there is little sign that any libvirt client application attempts to handle OOM in its own code, let alone care if libvirt can report it.

One important application process using the libvirt API though is the libvirtd daemon. In the very early days, if libvirtd stopped it would take down all running QEMU VMs, but this limitation was fixed over 10 years ago. To enable software upgrades on hosts with running VMs, libvirtd needs to be able to restart itself. As a result libvirtd maintains a record of important state on disk enabling it to carry on where it left off when starting up. Recovering from OOM by aborting and allowing the libvirtd to be restarted by systemd, would align with a code path that already needs to be well tested and supported for software upgrades.

Give up on OOM handling

With all the above in mind, the decision shouldn’t be a surprise. Libvirt has decided to stop attempting to handle ENOMEM from malloc() and related APIs and will instead immediately abort. The libvirtd daemon will automatically restart and carry on where it left off. The result is that the libvirt code can be dramatically simplified by removing many goto jump and cleanup code blocks, which reduces the maint burden on libvirt contributors, allowing more time to be spent on coding features which matter to users.

 

ANNOUNCE: libvirt-glib release 3.0.0

Posted: November 26th, 2019 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | No Comments »

I am pleased to announce that a new release of the libvirt-glib package, version 3.0.0, is now available from

https://libvirt.org/sources/glib/

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Add support for bochs video device
  • Add API to query firmware config
  • Improve testing coverage
  • Validate min/max glib API versions in use
  • Remove deprecated G_PARAM_PRIVATE
  • Fix docs build linking problems
  • Convert python demos to be python 3 compatible & use modern best practice for pyobject introspection bindings
  • Add API to query domain capaibilities
  • Refresh translations
  • Simplify build process for handling translations
  • Fix some memory leaks
  • Add API for setting storage volume features

Thanks to everyone who contributed to this new release through testing, patches, bug reports, translations and more.

Easier QEMU live tracing using systemtap

Posted: February 18th, 2019 | Filed under: Coding Tips, Fedora, Virt Tools | Tags: , , , | No Comments »

QEMU is able to leverage a number of live tracing systems, the choice configurable at build time between

  • log – printf formatted string for each event sent into QEMU’s logging system which writes to stderr
  • syslog – printf formatted string for each event sent via syslog
  • simple – binary data stream for each event written to a file or fifo pipe
  • ftrace – printf formatted string for each event sent to kernel ftrace facility
  • dtrace – user space probe markers dynamically enabled via dtrace or systemtap
  • ust – user space probe markers dynamically enabled via LTT-ng

Upstream QEMU enables the “log” trace backend by default since it is cross-platform portable and very simple to use by adding “-d trace:PATTERN” on the QEMU command line. For example to enable logging of all trace events in the QEMU I/O subsystem (aka “qio“) we can

$ qemu -d trace:qio* ...some args...
23266@1547735759.137292:qio_channel_socket_new Socket new ioc=0x563a8a39d400
23266@1547735759.137305:qio_task_new Task new task=0x563a891d0570 source=0x563a8a39d400 func=0x563a86f1e6c0 opaque=0x563a89078000
23266@1547735759.137326:qio_task_thread_start Task thread start task=0x563a891d0570 worker=0x563a86f1ce50 opaque=0x563a891d9d90
23273@1547735759.137491:qio_task_thread_run Task thread run task=0x563a891d0570
23273@1547735759.137503:qio_channel_socket_connect_sync Socket connect sync ioc=0x563a8a39d400 addr=0x563a891d9d90
23273@1547735759.138108:qio_channel_socket_connect_fail Socket connect fail ioc=0x563a8a39d400

This is very simple and surprisingly effective much of the time, but it is not without its downsides

  • Inactive probes have non-negligible performance impact on hot codepaths
  • It is targetted to human consumption, so it is not easy to reliably process with machines
  • It requires adding arguments to QEMU’s command line so is not easy to enable in many cases
  • It is custom to QEMU so does not facilitate getting correlated traces across the whole system

For these reasons, some downstreams chose not to use the default “log” backend. Both Fedora and RHEL have instead enabled the “dtrace” backend which can be leveraged via systemtap on Linux. This provides a very powerful tracing system, but the cost is that the previous simple task of printing a formatted string when a probe point fires has become MUCH more complicated. For example to get equivalent output to that seen with QEMU’s log backend would require

# cat > trace.stp <<EOF
probe qemu.system.x86_64.qio_task_new {
    printf("%d@%d qio_task_new Task new task=%p source=%p func=%p opaque=%p\n", 
           pid(), gettimeofday_ns(), task, source, func, opaque)
}
EOF
# stap trace.stp
22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400

Repeat that code snippet for every qio* probe point you want to watch, figuring out the set of args it has available to print.This quickly becomes tedious for what should be a simple logging job, especially if you need to reference null terminated strings from userspace.

After cursing this difficulty one time too many, it occurred to me that QEMU could easily do more to make life easier for systemtap users. The QEMU build system is already auto-generating all the trace backend specific code from a generic description of probes in the QEMU source tree. It has a format string which is used in the syslog, log and ftrace backends, but this is ignored for the dtrace backend. It did not take much to change the code generator so that it can use this format string to generate a convenient systemtap tapset representing the above manually written probe:

probe qemu.system.x86_64.log.qio_task_new = qemu.system.x86_64.qio_task_new ?
{
    printf("%d@%d qio_task_new Task new task=%p source=%p func=%p opaque=%p\n",
           pid(), gettimeofday_ns(), task, source, func, opaque)
}

This can be trivially executed with minimal knowledge of systemtap tapset language required

# stap -e "qemu.system.x86_64.log.qio_task_new{}"
22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400

Even better, we have now gained the ability to use wildcards too

# stap -e "qemu.system.x86_64.log.qio*{}"
23266@1547735759.137292:qio_channel_socket_new Socket new ioc=0x563a8a39d400
23266@1547735759.137305:qio_task_new Task new task=0x563a891d0570 source=0x563a8a39d400 func=0x563a86f1e6c0 opaque=0x563a89078000
23266@1547735759.137326:qio_task_thread_start Task thread start task=0x563a891d0570 worker=0x563a86f1ce50 opaque=0x563a891d9d90
23273@1547735759.137491:qio_task_thread_run Task thread run task=0x563a891d0570
23273@1547735759.137503:qio_channel_socket_connect_sync Socket connect sync ioc=0x563a8a39d400 addr=0x563a891d9d90
23273@1547735759.138108:qio_channel_socket_connect_fail Socket connect fail ioc=0x563a8a39d400

Users still, however, need to be aware of the naming convention for QEMU’s systemtap tapsets and how it maps to the particular QEMU binary that is used, and don’t forget the trailing “{}”. Thus I decided to go one step further and ship a small helper tool to make it even easier to use

$ qemu-trace-stap run qemu-system-x86_64 'qio*'
22806@1547735341399856820 qio_channel_socket_new Socket new ioc=0x56135d1d7c00
22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400
22806@1547735341399865943 qio_task_thread_start Task thread start task=0x56135cd66eb0 worker=0x56135af72e50 opaque=0x56135c071d70
22806@1547735341399976816 qio_task_thread_run Task thread run task=0x56135cd66eb0

The second argument to this tool is the QEMU binary filename to be traced, which can be relative (to search $PATH) or absolute. What is clever is that it will set the SYSTEMTAP_TAPSET env variable to point to the right location to find the corresponding tapset definition. This is very useful when you have multiple copies of QEMU on the system and need to make sure systemtap traces the right one.

The ‘qemu-trace-stap‘ script takes a verbose arg so you can understand what it is running behind the scenes:

$ qemu-trace-stap run /home/berrange/usr/qemu-git/bin/qemu-system-x86_64 'qio*'
Using tapset dir '/home/berrange/usr/qemu-git/share/systemtap/tapset' for binary '/home/berrange/usr/qemu-git/bin/qemu-system-x86_64'
Compiling script 'probe qemu.system.x86_64.log.qio* {}'
Running script, <Ctrl>-c to quit
...trace output...

It can enable multiple probes at once

$ qemu-trace-stap run qemu-system-x86_64 'qio*' 'qcrypto*' 'buffer*'

By default it monitors all existing running processes and all future launched proceses. This can be restricted to a specific PID using the –pid arg

$ qemu-trace-stap run --pid 2532 qemu-system-x86_64 'qio*'

Finally if you can’t remember what probes are valid it can tell you

$ qemu-trace-stap list qemu-system-x86_64
ahci_check_irq
ahci_cmd_done
ahci_dma_prepare_buf
ahci_dma_prepare_buf_fail
ahci_dma_rw_buf
ahci_irq_lower
...snip...

This new functionality merged into QEMU upstream a short while ago and will be included in the QEMU 4.0 release coming at the end of April.

Improved translation po file handling by ditching gettext autotools integration

Posted: November 29th, 2018 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: , , , | No Comments »

The libvirt library has long provided translations of its end user facing strings, which largely means error messages and console output from command line tools / daemons. Since libvirt uses autotools for its build system, it naturally used the standard automake integration provided by gettext for handling .po files. The libvirt.pot file with master strings is exported to Zanata, where the actual translation work is outsourced to the Fedora translation team who support up to ~100 languages. At time of writing libvirt has some level of translation in ~45 languages.

With use of Zanata, libvirt must periodically create an updated libvirt.pot file and push it to Zanata, and then just before release it must pull the latest translated .po files back into GIT for release.

There have been a number of problems with this approach which have been annoying us pretty much since the start, and earlier this year it finally became too much to bear any longer.

  • The per-language translation files stored in git contain source file name and line number annotations to indicate where each translatable string originates. Since the translation files are not re-generated on every single source file changes, the file locations annotations becomes increasingly out of date after every commit. When the translation files are updated 98% of the diff is simply changing source file locations leading to a very poor signal/noise ratio.
  • The strings in the per-language translation files are sorted according to source filename. Thus when code is moved between files, or when files are renamed, the strings in the updated translation files all get needlessly reordered, again leading to a poor signal/noise ratio in diffs.
  • Each language translation file contains every translatable string even those which do not have any translation yet. This makes sense if translators are working directly against the .po files, but in libvirt everything is done via the Zanata UI which already knows the list of untranslated strings.
  • The per-language translation files grow in size over time with previously used message strings appended to the end of the file, never discarded by the gettext tools. This again makes sense if translators are working directly against .po files, but Zanata already provides a full translation memory containing historically used strings.
  • Whenever ‘make dist’ is run the gettext autotools integration will regenerate the per-language translation files. As a result of the three previous points, every time a release is made there’s a giant commit more than 100MB in size that contains diffs for translated files which are entirely noise and no signal.

One suggested approach to deal with this is to stop storing translations in GIT at all and simply export them from Zanata only at time of ‘make dist’. The concern with this approach is that the GIT repository no longer contains the full source for the project in a self-contained manner. ‘make dist‘ now needs a live network connection to the Zanata servers. If we were to replace Zanata with a new tool in the future (Zanata is already a replacement for the previously used Transifex), we would potentially loose access to translations for old releases.

With this in mind we decided to optimize the way translations are managed in GIT.

The first easy win was to simply remove the master libvirt.pot file from GIT entirely. This file is auto-generated from the source files and is out of date the moment any source file changes, so no one would ever want to use the stored copy.

The second more complex step was to minimize and canonicalize the per-language translation files. msgmerge is used to take the full .po file and strip out the source file locations and sort the string alphabetically. A perl script is then used to further process the content dropping any translations marked as “fuzzy” and drop any strings for which there is no translated text available. The resulting output is still using the normal .po file format but we call these ‘.mini.po‘ files to indicate that they are stripped down compared to what you’d normally expect to see.

The final step was to remove the gettext / autotools integration and write a custom Makefile.am to handle the key tasks.

  • A target ‘update-mini-po‘ to automate the process of converting full .po files into .mini.po files. This is used when pulling down new translations from Zanata to be stored in git before release.
  • A target ‘update-po’ to automate the inverse process of converting .mini.po files back into full .po files. This is to be used by anyone who might need to look at full language translations outside of Zanata.
  • An install hook to generate the binary .gmo files from the .mini.po files and install them into /usr/share/locale for use at runtime. This avoids the need to ship the full .po files in release tarballs.
  • A target ‘zanata-push‘ to automate the process of re-generating the libvirt.pot file and uploading it to Zanata.
  • A target ‘zanata-pull‘ to automate the process of pulling new translations down from zanata and then triggering ‘update-mini-po

After all this work was completed the key benefits are

  • The size of content stored in GIT was reduced from ~100MB to ~18MB.
  • Updates to the translations in GIT now produce small diffstats with a high signal/noise ratio
  • Files stored in GIT are never changed as a side effect of build system commands like ‘make dist’
  • The autotools integration is easier to understand

while not having any visible change on the translators using Zanata. In the event anyone does need to see full translation languages outside of Zanata there is an extra step to generate the full .po files from the .mini.po files but this is countered by the fact that the result will be fully up to date with respect to translatable strings and source file locations.

I’d encourage any project which is using gettext autotools integration, while also outsourcing to a system like Zanata, to consider whether they’d benefit from taking similar steps to libvirt. Not all projects will get the same degree of space saving but diffstats with good signal/noise ratios and removing side effects from ‘make dist’ are wins that are likely desirable for any project.