Improving QEMU security part 6: TLS support for character devices

Posted: August 16th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , | 2 Comments »

This blog is part 6 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

A number of QEMU device models and objects use a character devices for providing connectivity with the outside world, including the QEMU monitor, serial ports, parallel ports, virtio serial channels, RNG EGD object, CCID smartcard passthrough, IPMI device, USB device redirection and vhost-user. While some of these will only ever need a character device configured with local connectivity, some will certainly need to make use of TCP connections to remote hosts. Historically these connections have always been entirely in clear text, which is unacceptable in the modern hostile network environment where even internal networks cannot be trusted. Clearly the QEMU character device code requires the ability to use TLS for encrypting sensitive data and providing some level of authentication on connections.

The QEMU character device code was mostly using GLib’s  GIOChannel framework for doing I/O but this has a number of unsatisfactory limitations. It can not do vectored I/O, is not easily extensible and does not concern itself at all with initial connection establishment. These are all reasons why the QIOChannel framework was added to QEMU. So the first step in supporting TLS on character devices was to convert the code over to use QIOChannel instead of GIOChannel. With that done, adding in support for TLS was quite straightforward, merely requiring addition of a new configuration property (“tls-creds“) to set the desired TLS credentials.

For example to run a QEMU VM with a serial port listening on IP 10.0.01, port 9000, acting as a TLS server:

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0,server \
      -device isa-serial,chardev=s0
      ...other QEMU options...

It is possible test connectivity to this TLS server using the gnutls-cli tool

$ gnutls-cli --priority=NORMAL -p 9000 \
--x509cafile=/home/berrange/security/qemutls/ca-cert.pem \
127.0.0.1

In the above example, QEMU was running as a TCP server, and acting as the TLS server endpoint, but this matching is not required. It is valid to configure it to run as a TLS client if desired, though this would be somewhat uncommon.

Of course you can connect 2 QEMU VMs together, both using TLS. Assuming the above QEMU is still running, we can launch a second QEMU connecting to it with

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0 \
      -device isa-serial,chardev=s0
      ...other QEMU options...

Notice, we’ve changed the “endpoint” and removed the “server” option, so this second QEMU runs as a TCP client and acts as the TLS client endpoint.

This feature is available since the QEMU 2.6.0 release a few months ago.

In this blog series:

ANNOUNCE: libvirt switch to time based rules for updating version numbers

Posted: June 14th, 2016 | Filed under: Fedora, libvirt, OpenStack | Tags: , | No Comments »

Until today, libvirt has used a 3 digit version number for monthly releases off the git master branch, and a 4 digit version number for maintenance releases off stable branches. Henceforth all releases will use 3 digits, and the next release will be 2.0.0, followed by 2.1.0, 2.2.0, etc, with stable releases incrementing the last digit (2.0.1, 2.0.2, etc) instead of appending yet another digit.

For the longer explanation read on…

We have the following rules about when we increment each digit in the version number

major
no one has any clue about when we should bump this
minor
bump this when some “significant”[*] features appear
micro
bump this on each new master branch release
extra
bump this for stable branch releases

[*] for a definition of “significant” that either no one knows, or that we invent post-update to justify why we changed the digit.

Now consider the actual requirements libvirt has

  • A number that increments on each release from master branches
  • A number that can be further incremented for stable branch releases without clashing with future master branch releases

The micro + extra digits alone deal with our two actual requirements, so one may ask what is the point of the major + minor digits in the version number ?

In 11 years of libvirt development we’ve only bumped the major digit once, and we didn’t have any real reason why we chose to the bump the major digit, instead of continuing to bump the minor digit. It just felt like we ought to have a 1.0 release after 7+ years. Our decisions about when to bump the minor digit have not been that much less arbitrary. We just look at what features are around and randomly decide if any feel “big enough” to justify a minor digit bump.

Way back in the early days of libvirt, we had exactly this kind of mess when deciding when to actually make releases. Sometimes we’d release after a month, sometimes after 3 months, completely arbitrarily based on whether the accumulated changes felt “big enough” to justify a release. Feature based release schedules are insanity as no one can predict when the next one might happen. Fortunately we wised up pretty quickly and adopted a time base release schedule where we release monthly approximately on the 1st. The only exception is over xmas/new year period, where we avoid Jan 1st and Feb 1st releases and instead have a Jan 15th release, giving a 6 week gap. There is no stated semantic difference between any of our releases off git master branch – they just include whatever happens to be ready at the time.

Considering version numbers again, it is clear that the reason why a feature based release timeline are a bad idea, is just as applicable to feature based version numbering rules. So we have decided to switch to a time based rule for incrementing the version number. Note, that this is *not* to be confused with switching to a time based version number. We want individual digits in the version number to be completely devoid of any semantics. Just as we don’t want version number changes to imply a particular level of feature changes, we also don’t want version numbers to correspond to dates of releases. IOW, we are *not* using the year and month to form the version number, rather that we are using the change in year and change in month as a trigger to update the version number. So our new version number rules are

major
bumped for the first release of each year
minor
bumped for every major release
micro
bumped for every stable branch release

Rather than wait until January 2017 to put this new rule into effect, we are pretending that July is January, so the next libvirt release will bump the major version number to 2.0.0. There after the releases will be 2.1.0, 2.2.0, etc until January 2017, when we’ll go to 3.0.0.  The maintenance releases based off 2.0.0 will be 2.0.1, 2.0.2, 2.0.3, etc, and live on a v2.0-maint branch in git.

So henceforth you should not interpret the libvirt version numbers as having any semantic meaning. They are merely indicating the progression of releases.

As a reminder, libvirt promises API and ABI stability forever, and the ELF library soname version number is thus fixed forever at libvirt.so.0, regardless of what version number a release has.

Analysis of techniques for ensuring migration completion with KVM

Posted: May 12th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | 1 Comment »

Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.

If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.

The techniques available

  • Downtime tuning. Unless the guest is completely idle, it never possible to get to a point where 100% of memory has been transferred to the target host. So at some point there needs to be a decision made about whether enough memory has been transferred to allow the switch over to the target host with acceptable blackout period. The downtime tunable controls how long a blackout period is permitted during the switchover. QEMU measures the network transfer rate it is achieving and compares it to the amount of outstanding RAM to determine if it can be transferred within the configured downtime window. When migrating it is not desirable to set QEMU to use the maximum accepted downtime straightaway, as that guarantees that the guest will always suffer from the maximum downtime blackout. Instead, it is better to start off with a fairly small downtime value and increase the permitted downtime as time passes. The idea is to maximise the likelihood that migration can complete with a small downtime.
  • Bandwidth tuning. If the migration is taking place over a NIC that is used for other non-migration related actions, it may be desirable to prevent the migration stream from consuming all bandwidth. As noted earlier though, even a relatively small guest is capable of dirtying RAM fast enough that even a 10Gbs NIC will not be able to complete migration. Thus if the goal is to maximise the chances of getting a successful migration though, the aim should be to maximise the network bandwidth available to the migration operation. Following on from this, it is wise not to try to run multiple migration operations in parallel unless their transfer rates show that they are not maxing out the available bandwidth, as running parallel migrations may well mean neither will ever finish.
  • Pausing CPUs. The simplest and crudest mechanism for ensuring guest migration complete is to simply pause the guest CPUs. This prevent the guest from continuing to dirty memory and thus even on the slowest network, it will ensure migration completes in a finite amount of time. The cost is that the guest workload will be completely stopped for a prolonged period of time. Think of pausing the guest as being equivalent to setting an arbitrarily long maximum permitted downtime. For example, assuming a guest with 8 GB of RAM and an idle 10Gbs NIC, in the worst case pausing would lead to to approx 6 second period of downtime. If higher speed NICs are available, the impact of pausing will decrease until it converges with a typical max downtime setting.
  • Auto-convergence. The rate at which a guest can dirty memory is related to the amount of time the guest CPUs are permitted to run for. Thus by throttling the CPU execution time it is possible to prevent the guest from dirtying memory so quickly and thus allow migration data transfer to keep ahead of RAM dirtying. If this feature is enabled, by default QEMU starts by cutting 20% of the guest vCPU execution time. At the startof each iteration over RAM, it will check progress during the previous two iterations. If insufficient forward progress is being made, it will repeatedly cut 10% off the running time allowed to vCPUs. QEMU will throttle CPUs all the way to 99%. This should guarantee that migration can complete on all by the most sluggish networks, but has a pretty high cost to guest CPU performance. It is also indiscriminate in that all guest vCPUs are throttled by the same factor, even if only one guest process is responsible for the memory dirtying.
  • Post-copy. Normally migration will only switch over to running on the target host once all RAM has been transferred. With post-copy, the goal is to transfer “enough” or “most” RAM across and then switch over to running on the target. When the target QEMU gets a fault for a memory page that has not yet been transferred, it’ll make an explicit out of band request for that page from the source QEMU. Since it is possible to switch to post-copy mode at any time, it avoids the entire problem of having to complete migration in a fixed downtime window. The cost is that while running in post-copy mode, guest page faults can be quite expensive, since there is a need to wait for the source host to transfer the memory page over to the target, which impacts performance of the guest during post-copy phase. If there is a network interruption while in post-copy mode it will also be impossible to recover. Since neither the source or target host has a complete view of the guest RAM it will be necessary to reboot the guest.
  • Compression. The migration pages are usually transferred to the target host as-is. For many guest workloads, memory page contents will be fairly easily compressible. So if there are available CPU cycles on the source host and the network bandwidth is a limiting factor, it may be worth while burning source CPUs in order to compress data transferred over the network. Depending on the level of compression achieved it may allow migration to complete. If the memory is not compression friendly though, it would be burning CPU cycles for no benefit. QEMU supports two compression methods, XBZRLE and multi-thread, either of which can be enabled. With XBZRLE a cache of previously sent memory pages is maintained that is sized to be some percentage of guest RAM. When a page is dirtied by the guest, QEMU compares the new page contents to that in the cache and then only sends a delta of the changes rather than the entire page. For this to be effective the cache size must generally be quite large – 50% of guest RAM would not be unreasonable.  The alternative compression approach uses multiple threads which simply use zlib to directly compress the full RAM pages. This avoids the need to maintain a large cache of previous RAM pages, but is much more CPU intensive unless hardware acceleration is available for the zlib compression algorithm.

Measuring impact of the techniques

Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.

In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.

Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/guestperf.py tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run

$ tests/migration/guestperf.py --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json

The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/guestperf-plot.py tool can consume this data file and produce interactive HTML charts illustrating the results.

$ tests/migration/guestperf-plot.py --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json

To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/guestperf-batch.py tool, in which case it is merely required to provide the desired hardware configuration

$ tests/migration/guestperf-batch.py --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb

This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same guestperf-plot.py tool can be used to create charts combining multiple data sets at once to allow easy comparison.

Performance results for QEMU 2.6

With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.

To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.

The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario

UNIX socket, 1 vCPU, 1 GB RAM

Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

UNIX socket, 4 vCPU, 8 GB RAM

Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 1 vCPU, 1 GB RAM

Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 4 vCPU, 8 GB RAM

Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 1 vCPU, 1 GB RAM

Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 4 vCPU, 8 GB RAM

Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 1 vCPU, 1 GB RAM

Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 4 vCPU, 8 GB RAM

Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

Analysis of results

The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show

  • There is a clear periodic pattern in guest performance that coincides with the start of each migration iteration. Specifically at the start of each iteration there is a notable and consistent momentary drop in guest CPU performance. Picking an example where this effect is clearly visible – the 1 vCPU, 1GB RAM config with the “Pause 5 iters, 300 mbs” test – we can see the guest CPU performance drop from 200ms/GB of data modified, to 450ms/GB. QEMU maintains a bitmap associated with guest RAM to track which pages are dirtied by the guest while migration is running. At the start of each iteration over RAM, this bitmap has to be read and reset and this action is what is responsible for this momentary drop in performance.
  • With the larger guest sizes, there is a second roughly periodic but slightly more chaotic pattern in guest performance that is continual throughout migration. The magnitude of these spikes is about 1/2 that of those occurring at the start of each iteration. An example where this effect is clearly visible is the 4 vCPU, 8GB RAM config with the “Pause unlimited BW, 20 iters” test – we can see the guest CPU performance is dropping from 500ms/GB to between 700ms/GB and 800ms/GB. The host NUMA node that the guest is confined to has 4 CPUs and the guest itself has 4 CPUs. When migration is running, QEMU has a dedicated thread performing the migration data I/O and this is sharing time on the 4 host CPUs with the guest CPUs. So with QEMU emulator threads sharing the same pCPUs as the vCPU threads, we have 5 workloads competing for 4 CPUs. IOW the frequently slightly chaotic spikes in guest performance throughout the migration iteration are a result of overcommiting the host pCPUs. The magnitude of the spikes is directly proportional to the total transfer bandwidth permitted for the migration. This is not an inherent problem with migration – it would be possible to place QEMU emulator threads on a separate pCPU from vCPU threads if strong isolation is desired between the guest workload and migration processing.
  • The baseline guest CPU performance differs between the 1 vCPU, 1 GB RAM and 4 vCPU 8 GB RAM guests. Comparing the UNIX socket “Pause unlimited BW, 20 iters” test results for these 1 vCPU and 4 vCPU configs we see the former has a baseline performance of 200ms/GB of data modified while the latter has 400ms/GB of data modified. This is clearly nothing to do with migration at all. Naively one might think that going from 1 vCPU to 4 vCPUs would result in 4 times the performance, since we have 4 times more threads available to do work. What we’re seeing here is likely the result of hitting the memory bandwidth limit, so each vCPU is competing for memory bandwidth and thus the overall performance of each vCPU has decreased. So instead of getting x4 the performance going from 1 to 4 vCPUs only doubled the performance.
  • When post-copy is operating in its pre-copy phase, it has no measurable impact on the gust performance compared to when post-copy is not enabled at all. This can be seen by comparing the TCP socket “Paused 5 iters, 1 Gbs” test results with the “Post-copy 5 iters, 1 Gbs” test results. Both show the same baseline guest CPU performance and the same magnitude of spikes at the start of each iteration. This shows that it is viable to unconditionally enable the post-copy feature for all migration operations, even if the migration is likely to complete without needing to switch from pre-copy to post-copy phases. It provides the admin/app the flexibility to dynamically decide on the fly whether to switch to post-copy mode or stay in pre-copy mode until completion.
  • When post-copy migration switches from its pre-copy phase to the post-copy phase, there is a major but short-lived spike in guest CPU performance. What is happening here is that the guest has perhaps 80% of its RAM transferred to the target host when post-copy phase starts but the guest workload is touching some pages which are still on the source, so the page fault is having to wait for the page to be transferred across the network. The magnitude of the spike and duration of the post-copy phase is related to the total guest RAM size and bandwidth available. Taking the remote TCP case with 1 vCPU, 1 GB RAM hardware config for clarity, and comparing the “Post-copy 5 iters, 1Gbs” scenario with the “Post-copy 5 iters, 10Gbs” scenario, we can see the magnitude of the spike in guest performance is the same order of magnitude in both cases. The overall time for each iteration of pre-copy phase is clearly shorter in the 10Gbs case. If we further compare with the local UNIX domain socket, we can see the spike in performance is much lower at the post-copy phase. What this is telling us is that the magnitude of the spike in the post-copy phase is largely driven by the latency in the time to transfer an out of band requested page from the source to the target, rather than the overall bandwidth available. There are plans in QEMU to allow migration to use multiple TCP connections which should significantly reduce the post-copy latency spike as the out-of-band requested pages will not get stalled behind a long TCP transmit queue for the background bulk-copy.
  • Auto-converge will often struggle to ensure convergence for larger guest sizes or when the bandwidth is limited. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing effects of different bandwidth limits we can see that with a 10Gbs bandwidth cap, auto-converge had to throttle to 80% to allow completion, while other tests show as much as 95% or even 99% in some cases. With a lower bandwidth limit of 1Gbs, the test case timed out after 5 minutes of running, having only attempted throttled down by 20%, showing auto-converge is not nearly aggressive enough when faced with low bandwidth links. The worst case guest performance seen when running auto-converge with CPUs throttled to 80% was on a par with that seen with post-copy immediately after switching to post-copy phase. The difference is that auto-converge shows that worst-case hit for a very long time during pre-copy, potentially many minutes, where as post-copy only showed it for a few seconds.
  • Multi-thread compression was actively harmful to chances of a successful migration. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing thread counts, we can see that increasing the number of threads actually made performance worse, with less iterations over RAM being completed before the 5 minute timeout was hit. The longer each iteration takes the more time the guest has to dirty RAM, so the less likely migration is to complete. There are two factors believe to be at work here to make MT compression results so bad. First, as noted earlier QEMU is confined to 4 pCPUs, so with 4 vCPUs running, the compression threads have to compete for time with the vCPU threads slowing down speed of compression. The stress test workload run in the guest is writing completely random bytes which are a pathological input dataset for compression, allowing almost no compression. Given the fact the compression was CPU limited though, even if there had been a good compression ratio, it would be unlikely to have a significant benefit since the increased time to iterate over RAM would allow the guest to dirty more data eliminating the advantage of compressing it. If the QEMU emulator threads were given dedicated host pCPUs to run on it may have increased the performance somewhat, but then that assumes the host has CPUs free that are not running other guests.
  • XBZRLE compression fared a little better than MT compression. Again considering the 4 vCPU, 8 GB RAM remote TCP test comparing RAM cache sizing, we can see that the time required for each iteration over RAM did not noticeably increase. This shows that while XBZRLE compression did have a notable impact on guest CPU performance, is not seeing a major bottleneck on processing of each page as compared to MT compression. Again though, it did not help to achieve migration completion, with all tests timing out after 5 minutes or 30 iterations over RAM. This is due to the fact that the guest stress workload is again delivering input data that hits the pathological worst case in the algorithm. Faced with such a workload, no matter how much CPU time or RAM cache is available, XBZRLE can never have any positive impact on migration.
  • The RDMA data transport showed up a few of its quirks. First, by looking at the RDMA results comparing pause bandwidth, we can clearly identify a bug in QEMU’s RDMA implementation – it is not honouring the requested bandwidth limits – it always transfers at maximum link speed. Second, all the post-copy results show failure, confirming that post-copy is currently not compatible with RDMA migration. When comparing 10Gbs RDMA against 10Gbs TCP transports, there is no obvious benefit to using RDMA – it was not any more likely to complete migration in any of the test scenarios.

Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link

It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.

There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.

In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.

In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.

Improving QEMU security part 5: TLS support for NBD server & client

Posted: April 5th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , | No Comments »

This blog is part 5 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

For many years now QEMU has had code to support the NBD protocol, either as a client or as a server. The qemu-nbd command line tool can be used to export a disk image over NBD to a remote machine, or connect it directly to the local kernel’s NBD block device driver. The QEMU system emulators also have a block driver that acts as an NBD client, allowing VMs to be run from NBD volumes. More recently the QEMU system emulators gained the ability to export the disks from a running VM as named NBD volumes. The latter is particularly interesting because it is the foundation of live migration with block device replication, allowing VMs to be migrated even if you don’t have shared storage between the two hosts. In common with most network block device protocols, NBD has never offered any kind of data security capability. Administrators are recommended to run NBD over a private LAN/vLAN, use network layer security like IPSec, or tunnel it over some other kind of secure channel. While all these options are capable of working, none are very convenient to use because they require extra setup steps outside of the basic operation of the NBD server/clients. Libvirt has long had the ability to tunnel the QEMU migration channel over its own secure connection to the target host, but this has not been extended to cover the NBD channel(s) opened when doing block migration. While it could theoretically be extended to cover NBD, it would not be ideal from a performance POV because the libvirtd architecture means that the TLS encryption/decryption for multiple separate network connections would be handled by a single thread. For fast networks (10-GigE), libvirt will quickly become the bottleneck on performance even if the CPU has native support for AES.

Thus it was decided that the QEMU NBD client & server would need to be extended to support TLS encryption of the data channel natively. Initially the thought was to just add a flag to the client/server code to indicate that TLS was desired and run the TLS handshake before even starting the NBD protocol. After some discussion with the NBD maintainers though, it was decided to explicitly define a way to support TLS in the NBD protocol negotiation phase. The primary benefit of doing this is to allow clearer error reporting to the user if the client connects to a server requiring use of TLS and the client itself does not support TLS, or vica-verca – ie instead of just seeing what appears to be a mangled NBD handshake and not knowing what it means, the client can clearly report “This NBD server requires use of TLS encryption”.

The extension to the NBD protocol was fairly straightforward. After the initial NBD greeting (where the client & server agree the NBD protocol variant to be used) the client is able to request a number of protocol options. A new option was defined to allow the client to request TLS support. If the server agrees to use TLS, then they perform a standard TLS handshake and the rest of the NBD protocol carries on as normal. To prevent downgrade attacks, if the NBD server requires TLS and the client does not request the TLS option, then it will respond with an error and drop the client. In addition if the server requires TLS, then TLS must be the first option that the client requests – other options are only permitted once the TLS session is active & the server will again drop the client if it tries to request non-TLS options first.

The QEMU NBD implementation was originally using plain POSIX sockets APIs for all its I/O. So the first step in enabling TLS was to update the NBD code so that it used the new general purpose QEMU I/O channel  APIs instead. With that done it was simply a matter of instantiating a new QIOChannelTLS object at the correct part of the protocol handshake and adding various command line options to the QEMU system emulator and qemu-nbd program to allow the user to turn on TLS and configure x509 certificates.

Running a NBD server using TLS can be done as follows:

$ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
           --tls-creds tls0 /path/to/disk/image.qcow2

On the client host, a QEMU guest can then be launched, connecting to this NBD server:

$ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
                     -drive driver=nbd,host=theotherhost,port=10809,tls-creds=tls0 \
                     ...other QEMU options...

Finally to enable support for live migration with block device replication, the QEMU system monitor APIs gained support for a new parameter when starting the internal NBD server. All of this code was merged in time for the forthcoming QEMU 2.6 release. Work has not yet started to enable TLS with NBD in libvirt, as there is little point securing the NBD protocol streams, until the primary live migration stream is using TLS. More on live migration in a future blog post, as that’s going to be QEMU 2.7 material now.

In this blog series:

Improving QEMU security part 4: generic I/O channel framework to simplify TLS

Posted: April 4th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , , | No Comments »

This blog is part 4 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

Part 2 of this series described the creation of a general purpose API for simplifying TLS session handling inside QEMU, particularly with a view to hiding the complexity of the handshake and x509 certificate validation. The VNC server was converted to use this API, which was a big benefit, but there was still a need to add extra code to support TLS in the I/O paths. Specifically, anywhere that the VNC server would read/write on the network socket, had to be made TLS aware so that it would use plain POSIX send/recv functions vs the TLS wrapped send/recv functions as appropriate. For the VNC server it is actually even more complex, because it also supports websockets, so each I/O point had to choose between plain, TLS, websockets and websockets plus TLS.  As TLS support extends to other areas of QEMU this pattern would continue to complicate I/O paths in each backend.

Clearly there was a need for some form of I/O channel abstraction that would allow TLS to be enabled in each QEMU network backend without having to add conditional logic at every I/O send/recv call. Looking around at the QEMU subsystems that would ultimately need TLS support, showed a variety of approaches currently in use

  • Character devices use combination of POSIX sockets APIs to establish connections and GIOChannel for performing I/O on them
  • Migration has a QEMUFile abstraction which provides read/write facilities for a number of underlying transports, TCP sockets, UNIX sockets, STDIO, external command, in memory buffer and RDMA. The various QEMUFile impls all uses the plain POSIX sockets APIs and for TCP/UNIX sockets the sendmsg/recvmsg functions for I/O
  • NBD client & server use plain POSIX sockets APIs and sendmsg/recvmsg for I/O
  • VNC server uses plain POSIX sockets APIs and sendmsg/recvmsg for I/O

The GIOChannel APIs used by the character device backend theoretically provide an extensible framework for I/O and there is even a TLS implementation of the GIOChannel API. The two limitations of GIOChannel for QEMU though are that it does not support scatter / gather / vectored I/O APIs and that it does not support file descriptor passing over UNIX sockets. The latter is not a show stopper, since you can still access the socket handle directly to send/recv file descriptors. The lack of vectored I/O though would be a significant issue for migration and NBD servers where performance is very important. While we could potentially extend GIOChannel to add support for new callbacks to do vectored I/O, by the time you’ve done that most of the original GIOChannel code isn’t going to be used, limiting the benefit of starting from GIOChannel as a base. It is also clear that GIOChannel is really not something that is going to get any further development from the GLib maintainers, since their focus is on the new and much better GIO library. This supports file descriptor passing and TLS encryption, but again lacks support for vectored I/O. The bigger show stopper though is that to get access to the TLS support requires depending on a version on GLib that is much newer than what QEMU is willing to use. The existing QEMUFile APIs could form the basis of a general purpose I/O channel system if they were untangled & extracted from migration codebase. One limitation is that QEMUFile only concerns itself with I/O, not the initial channel establishment which is left to the migration core code to deal with, so did not actually provide very much of a foundation on which to build.

After looking through the various approaches in use in QEMU, and potentially available from GLib, it was decided that QEMU would be best served by creating a new general purpose I/O channel API. Thus a new QEMU subsystem was added in the io/ and include/io/ directories to provide a set of classes for I/O over a variety of different data channels. The core design aims were to use the QEMU object model (QOM) framework to provide a standard pattern for extending / subclassing, use the QEMU Error object for all error reporting, file  descriptor passing, main loop watch integration and coroutine integration. Overall the new design took many elements of its design from GIOChannel and the GIO library, and blended them with QEMU’s own codebase design. The initial goal was to provide enough functionality to convert the VNC server as a proof of concept. To this end the following classes were created

  • QIOChannel – the abstract base defining the overall interface for the I/O framework
  • QIOChannelSocket – implementation targeting TCP, UDP and UNIX sockets
  • QIOChannelTLS – layer that can provide a TLS session over any other channel
  • QIOChannelWebsock – layer that can run the websockets protocol over any other channel

To avoid making this blog posting even larger, I won’t go into details of these (the code is available in QEMU git for anyone who’s really interesting), but instead illustrate it with a comparison of the VNC code before & after. First consider the original code in the VNC server for dealing with writing a buffer of data over a plain socket or websocket either with TLS enabled. The following functions existed in the VNC server code to handle all the combinations:

ssize_t vnc_tls_push(const char *buf, size_t len, void *opaque)
{
    VncState *vs = opaque;
    ssize_t ret;

 retry:
    ret = send(vs->csock, buf, len, 0);
    if (ret < 0) {
        if (errno == EINTR) {
            goto retry;
        }
        return -1;
    }
    return ret;
}

ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
    ssize_t ret;
    int err = 0;
    if (vs->tls) {
        ret = qcrypto_tls_session_write(vs->tls, (const char *)data, datalen);
        if (ret < 0) {
            err = errno;
        }
    } else {
        ret = send(vs->csock, (const void *)data, datalen, 0);
        if (ret < 0) {
            err = socket_error();
        }
    }
    return vnc_client_io_error(vs, ret, err);
}

long vnc_client_write_ws(VncState *vs)
{
    long ret;
    vncws_encode_frame(&vs->ws_output, vs->output.buffer, vs->output.offset);
    buffer_reset(&vs->output);
    return vnc_client_write_buf(vs, vs->ws_output.buffer, vs->ws_output.offset);
}

static void vnc_client_write_locked(void *opaque)
{
    VncState *vs = opaque;

    if (vs->encode_ws) {
        vnc_client_write_ws(vs);
    } else {
        vnc_client_write_plain(vs);
    }
}

After conversion to use the new QIOChannel classes for sockets, websockets and TLS, all of the VNC server code above turned into

ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
    Error *err = NULL;
    ssize_t ret;
    ret = qio_channel_write(vs->ioc, (const char *)data, datalen, &err);
    return vnc_client_io_error(vs, ret, &err);
}

It is clearly a major win for maintainability of the VNC server code to have all the TLS and websockets I/O support handled by the QIOChannel APIs. There is no impact to supporting TLS and websockets anywhere in the VNC server I/O paths now. The only place where there is new code is the point where the TLS or websockets session is initiated and this now only requires instantiation of a suitable QIOChannel subclass and registering a callback to be run when the session handshake completes (or fails).

tls = qio_channel_tls_new_server(vs->ioc, vs->vd->tlscreds, vs->vd->tlsaclname, &err);
if (!tls) {
    vnc_client_error(vs);
    return 0;
}

object_unref(OBJECT(vs->ioc));
vs->ioc = QIO_CHANNEL(tls);

qio_channel_tls_handshake(tls, vnc_tls_handshake_done, vs, NULL);

Notice that the code is simply replacing the current QIOChannel handle ‘vs->ioc’ with an instance of the QIOChannelTLS class. The vnc_tls_handshake_done method is invoked when the TLS handshake is complete or failed and lets the VNC server continue with the next part of its authentication protocol, or drop the client connection as appropriate. So adding TLS session support to the VNC server comes in at about 10 lines of code now.

In this blog series: