Improving QEMU security part 5: TLS support for NBD server & client

Posted: April 5th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , | No Comments »

This blog is part 5 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

For many years now QEMU has had code to support the NBD protocol, either as a client or as a server. The qemu-nbd command line tool can be used to export a disk image over NBD to a remote machine, or connect it directly to the local kernel’s NBD block device driver. The QEMU system emulators also have a block driver that acts as an NBD client, allowing VMs to be run from NBD volumes. More recently the QEMU system emulators gained the ability to export the disks from a running VM as named NBD volumes. The latter is particularly interesting because it is the foundation of live migration with block device replication, allowing VMs to be migrated even if you don’t have shared storage between the two hosts. In common with most network block device protocols, NBD has never offered any kind of data security capability. Administrators are recommended to run NBD over a private LAN/vLAN, use network layer security like IPSec, or tunnel it over some other kind of secure channel. While all these options are capable of working, none are very convenient to use because they require extra setup steps outside of the basic operation of the NBD server/clients. Libvirt has long had the ability to tunnel the QEMU migration channel over its own secure connection to the target host, but this has not been extended to cover the NBD channel(s) opened when doing block migration. While it could theoretically be extended to cover NBD, it would not be ideal from a performance POV because the libvirtd architecture means that the TLS encryption/decryption for multiple separate network connections would be handled by a single thread. For fast networks (10-GigE), libvirt will quickly become the bottleneck on performance even if the CPU has native support for AES.

Thus it was decided that the QEMU NBD client & server would need to be extended to support TLS encryption of the data channel natively. Initially the thought was to just add a flag to the client/server code to indicate that TLS was desired and run the TLS handshake before even starting the NBD protocol. After some discussion with the NBD maintainers though, it was decided to explicitly define a way to support TLS in the NBD protocol negotiation phase. The primary benefit of doing this is to allow clearer error reporting to the user if the client connects to a server requiring use of TLS and the client itself does not support TLS, or vica-verca – ie instead of just seeing what appears to be a mangled NBD handshake and not knowing what it means, the client can clearly report “This NBD server requires use of TLS encryption”.

The extension to the NBD protocol was fairly straightforward. After the initial NBD greeting (where the client & server agree the NBD protocol variant to be used) the client is able to request a number of protocol options. A new option was defined to allow the client to request TLS support. If the server agrees to use TLS, then they perform a standard TLS handshake and the rest of the NBD protocol carries on as normal. To prevent downgrade attacks, if the NBD server requires TLS and the client does not request the TLS option, then it will respond with an error and drop the client. In addition if the server requires TLS, then TLS must be the first option that the client requests – other options are only permitted once the TLS session is active & the server will again drop the client if it tries to request non-TLS options first.

The QEMU NBD implementation was originally using plain POSIX sockets APIs for all its I/O. So the first step in enabling TLS was to update the NBD code so that it used the new general purpose QEMU I/O channel  APIs instead. With that done it was simply a matter of instantiating a new QIOChannelTLS object at the correct part of the protocol handshake and adding various command line options to the QEMU system emulator and qemu-nbd program to allow the user to turn on TLS and configure x509 certificates.

Running a NBD server using TLS can be done as follows:

$ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
           --tls-creds tls0 /path/to/disk/image.qcow2

On the client host, a QEMU guest can then be launched, connecting to this NBD server:

$ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
                     -drive driver=nbd,host=theotherhost,port=10809,tls-creds=tls0 \
                     ...other QEMU options...

Finally to enable support for live migration with block device replication, the QEMU system monitor APIs gained support for a new parameter when starting the internal NBD server. All of this code was merged in time for the forthcoming QEMU 2.6 release. Work has not yet started to enable TLS with NBD in libvirt, as there is little point securing the NBD protocol streams, until the primary live migration stream is using TLS. More on live migration in a future blog post, as that’s going to be QEMU 2.7 material now.

In this blog series:

Writing the Nova file injection code to use libguestfs APIs instead of FUSE

Posted: November 15th, 2012 | Filed under: Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , , , | No Comments »

When launching a virtual machine, Nova has the ability to inject various files into the disk image immediately prior to boot up. This is used to perform the following setup operations:

  • Add an authorized SSH key for the root account
  • Configure init to reset SELinux labelling on /root/.ssh
  • Set the login password for the root account
  • Copy data into a number of user specified files
  • Create the meta.js file
  • Configure network interfaces in the guest

This file injection is handled by the code in the nova.virt.disk.api module. The code which does the actual injection is designed around the assumption that the filesystem in the guest image can be mapped into a location in the host filesystem. There are a number of ways this can be done, so Nova has a pluggable API for mounting guest images in the host, defined by the nova.virt.disk.mount module, with the following implementations:

  • Loop – Use losetup to create a loop device. Then use kpartx to map the partitions within the device, and finally mount the designated partition. Alternatively on new enough kernels the loop device’s builtin partition support is used instead of kpartx.
  • NBD – Use qemu-nbd to run a NBD server and attach with the kernel NBD client to expose a device. Then mapping partitions is handled as per Loop module
  • GuestFS – Use libguestfs to inspect the image and setup a FUSE mount for all partitions or logical volumes inside the image.

The Loop module can only handle Raw format files, while the NBD module can handle any format that QEMU supports. While they have the ability to access partitions, the code handling this is very dumb. It requires the Nova global ‘libvirt_inject_partition’ config parameter to specify which partition number to inject. The result is that every image you upload to glance must be partitioned in exactly the same way. Much better would be if it used a metadata parameter associated with the image. The GuestFS module is much more advanced and inspects the guest OS to figure out arbitrarily partitioned images and even LVM based images.

Nova has a “img_handlers” configuration parameter which defines the order in which the 3 mount modules above are to be tried. It tries to mount the image with each one in turn, until one suceeds. This is quite crude code really – it has already been hacked to avoid trying the Loop module if Nova knows it is using QCow2. It has to be changed by the Nova admin if they’re using LXC, otherwise you can end up using KVM with LXC guests which is probably not what you want. The try-and-fallback paradigm also has the undesirable behaviour of masking errors that you would really rather consider fatal to the boot process.

As mentioned earlier, the file injection code uses the mount modules to map the guest image filesystem into a temporary directory in the host (such as /tmp/openstack-XXXXXX). It then runs various commands like chmod, chown, mkdir, tee, etc to manipulate files in the guest image. Of course Nova runs as an unprivileged user, and the guest files to be changed are typically owned as root. This means all the file injection commands need to run via Nova’s rootwrap utility to gain root privileges. Needless to say, this has the undesirable consequence that the code injecting files into a guest image in fact has privileges that allow it to write to arbitrary areas of the host filesystem. One mistake in handling symlinks and you have the potential for a carefully crafted guest image to result in compromise of the host OS. It should come as little surprise that this has already resulted in a security vulnerability / CVE against Nova.

The solution to this class of security problems is to decouple the file injection code from the host filesystem. This can be done by introducing a “VFS” (Virtual File System) interface which defines a formal API for the various logical operations that need to be performed on a guest filesystem. With that it is possible to provide an implementation that uses the libguestfs native python API, rather than FUSE mounts. As well as being inherently more secure, avoiding the FUSE layer will improve performance, and allow Nova to utilize libguestfs APIs that don’t map into FUSE, such as its Augeas support for parsing config files. Nova still needs to work in scenarios where libguestfs is not available though, so a second implementation of the VFS APIs will be required based on the existing Loop/Nbd device mount approach. The security of the non-libguestfs support has not changed with this refactoring work, but de-coupling the file injection code from the host filesystem does make it easier to write unit tests for this code. The file injection code can be tested by mocking out the VFS layer, while the VFS implementations can be tested by mocking out the libguestfs or command execution APIs.

Incidentally if you’re wondering why Libguestfs does all its work inside a KVM appliance, its man page describes the security issues this approach protects against vs just directly mounting guest images on the host