Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.
If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.
The techniques available
- Downtime tuning. Unless the guest is completely idle, it never possible to get to a point where 100% of memory has been transferred to the target host. So at some point there needs to be a decision made about whether enough memory has been transferred to allow the switch over to the target host with acceptable blackout period. The downtime tunable controls how long a blackout period is permitted during the switchover. QEMU measures the network transfer rate it is achieving and compares it to the amount of outstanding RAM to determine if it can be transferred within the configured downtime window. When migrating it is not desirable to set QEMU to use the maximum accepted downtime straightaway, as that guarantees that the guest will always suffer from the maximum downtime blackout. Instead, it is better to start off with a fairly small downtime value and increase the permitted downtime as time passes. The idea is to maximise the likelihood that migration can complete with a small downtime.
- Bandwidth tuning. If the migration is taking place over a NIC that is used for other non-migration related actions, it may be desirable to prevent the migration stream from consuming all bandwidth. As noted earlier though, even a relatively small guest is capable of dirtying RAM fast enough that even a 10Gbs NIC will not be able to complete migration. Thus if the goal is to maximise the chances of getting a successful migration though, the aim should be to maximise the network bandwidth available to the migration operation. Following on from this, it is wise not to try to run multiple migration operations in parallel unless their transfer rates show that they are not maxing out the available bandwidth, as running parallel migrations may well mean neither will ever finish.
- Pausing CPUs. The simplest and crudest mechanism for ensuring guest migration complete is to simply pause the guest CPUs. This prevent the guest from continuing to dirty memory and thus even on the slowest network, it will ensure migration completes in a finite amount of time. The cost is that the guest workload will be completely stopped for a prolonged period of time. Think of pausing the guest as being equivalent to setting an arbitrarily long maximum permitted downtime. For example, assuming a guest with 8 GB of RAM and an idle 10Gbs NIC, in the worst case pausing would lead to to approx 6 second period of downtime. If higher speed NICs are available, the impact of pausing will decrease until it converges with a typical max downtime setting.
- Auto-convergence. The rate at which a guest can dirty memory is related to the amount of time the guest CPUs are permitted to run for. Thus by throttling the CPU execution time it is possible to prevent the guest from dirtying memory so quickly and thus allow migration data transfer to keep ahead of RAM dirtying. If this feature is enabled, by default QEMU starts by cutting 20% of the guest vCPU execution time. At the startof each iteration over RAM, it will check progress during the previous two iterations. If insufficient forward progress is being made, it will repeatedly cut 10% off the running time allowed to vCPUs. QEMU will throttle CPUs all the way to 99%. This should guarantee that migration can complete on all by the most sluggish networks, but has a pretty high cost to guest CPU performance. It is also indiscriminate in that all guest vCPUs are throttled by the same factor, even if only one guest process is responsible for the memory dirtying.
- Post-copy. Normally migration will only switch over to running on the target host once all RAM has been transferred. With post-copy, the goal is to transfer “enough” or “most” RAM across and then switch over to running on the target. When the target QEMU gets a fault for a memory page that has not yet been transferred, it’ll make an explicit out of band request for that page from the source QEMU. Since it is possible to switch to post-copy mode at any time, it avoids the entire problem of having to complete migration in a fixed downtime window. The cost is that while running in post-copy mode, guest page faults can be quite expensive, since there is a need to wait for the source host to transfer the memory page over to the target, which impacts performance of the guest during post-copy phase. If there is a network interruption while in post-copy mode it will also be impossible to recover. Since neither the source or target host has a complete view of the guest RAM it will be necessary to reboot the guest.
- Compression. The migration pages are usually transferred to the target host as-is. For many guest workloads, memory page contents will be fairly easily compressible. So if there are available CPU cycles on the source host and the network bandwidth is a limiting factor, it may be worth while burning source CPUs in order to compress data transferred over the network. Depending on the level of compression achieved it may allow migration to complete. If the memory is not compression friendly though, it would be burning CPU cycles for no benefit. QEMU supports two compression methods, XBZRLE and multi-thread, either of which can be enabled. With XBZRLE a cache of previously sent memory pages is maintained that is sized to be some percentage of guest RAM. When a page is dirtied by the guest, QEMU compares the new page contents to that in the cache and then only sends a delta of the changes rather than the entire page. For this to be effective the cache size must generally be quite large – 50% of guest RAM would not be unreasonable. The alternative compression approach uses multiple threads which simply use zlib to directly compress the full RAM pages. This avoids the need to maintain a large cache of previous RAM pages, but is much more CPU intensive unless hardware acceleration is available for the zlib compression algorithm.
Measuring impact of the techniques
Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.
In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.
Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/guestperf.py tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run
$ tests/migration/guestperf.py --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json
The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/guestperf-plot.py tool can consume this data file and produce interactive HTML charts illustrating the results.
$ tests/migration/guestperf-plot.py --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json
To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/guestperf-batch.py tool, in which case it is merely required to provide the desired hardware configuration
$ tests/migration/guestperf-batch.py --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb
This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same guestperf-plot.py tool can be used to create charts combining multiple data sets at once to allow easy comparison.
Performance results for QEMU 2.6
With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.
To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.
The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario
UNIX socket, 1 vCPU, 1 GB RAM
Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM
UNIX socket, 4 vCPU, 8 GB RAM
Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM
TCP socket local, 1 vCPU, 1 GB RAM
Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM
TCP socket local, 4 vCPU, 8 GB RAM
Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM
TCP socket remote, 1 vCPU, 1 GB RAM
Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM
TCP socket remote, 4 vCPU, 8 GB RAM
Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM
RDMA socket, 1 vCPU, 1 GB RAM
Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM
RDMA socket, 4 vCPU, 8 GB RAM
Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM
Analysis of results
The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show
- There is a clear periodic pattern in guest performance that coincides with the start of each migration iteration. Specifically at the start of each iteration there is a notable and consistent momentary drop in guest CPU performance. Picking an example where this effect is clearly visible – the 1 vCPU, 1GB RAM config with the “Pause 5 iters, 300 mbs” test – we can see the guest CPU performance drop from 200ms/GB of data modified, to 450ms/GB. QEMU maintains a bitmap associated with guest RAM to track which pages are dirtied by the guest while migration is running. At the start of each iteration over RAM, this bitmap has to be read and reset and this action is what is responsible for this momentary drop in performance.
- With the larger guest sizes, there is a second roughly periodic but slightly more chaotic pattern in guest performance that is continual throughout migration. The magnitude of these spikes is about 1/2 that of those occurring at the start of each iteration. An example where this effect is clearly visible is the 4 vCPU, 8GB RAM config with the “Pause unlimited BW, 20 iters” test – we can see the guest CPU performance is dropping from 500ms/GB to between 700ms/GB and 800ms/GB. The host NUMA node that the guest is confined to has 4 CPUs and the guest itself has 4 CPUs. When migration is running, QEMU has a dedicated thread performing the migration data I/O and this is sharing time on the 4 host CPUs with the guest CPUs. So with QEMU emulator threads sharing the same pCPUs as the vCPU threads, we have 5 workloads competing for 4 CPUs. IOW the frequently slightly chaotic spikes in guest performance throughout the migration iteration are a result of overcommiting the host pCPUs. The magnitude of the spikes is directly proportional to the total transfer bandwidth permitted for the migration. This is not an inherent problem with migration – it would be possible to place QEMU emulator threads on a separate pCPU from vCPU threads if strong isolation is desired between the guest workload and migration processing.
- The baseline guest CPU performance differs between the 1 vCPU, 1 GB RAM and 4 vCPU 8 GB RAM guests. Comparing the UNIX socket “Pause unlimited BW, 20 iters” test results for these 1 vCPU and 4 vCPU configs we see the former has a baseline performance of 200ms/GB of data modified while the latter has 400ms/GB of data modified. This is clearly nothing to do with migration at all. Naively one might think that going from 1 vCPU to 4 vCPUs would result in 4 times the performance, since we have 4 times more threads available to do work. What we’re seeing here is likely the result of hitting the memory bandwidth limit, so each vCPU is competing for memory bandwidth and thus the overall performance of each vCPU has decreased. So instead of getting x4 the performance going from 1 to 4 vCPUs only doubled the performance.
- When post-copy is operating in its pre-copy phase, it has no measurable impact on the gust performance compared to when post-copy is not enabled at all. This can be seen by comparing the TCP socket “Paused 5 iters, 1 Gbs” test results with the “Post-copy 5 iters, 1 Gbs” test results. Both show the same baseline guest CPU performance and the same magnitude of spikes at the start of each iteration. This shows that it is viable to unconditionally enable the post-copy feature for all migration operations, even if the migration is likely to complete without needing to switch from pre-copy to post-copy phases. It provides the admin/app the flexibility to dynamically decide on the fly whether to switch to post-copy mode or stay in pre-copy mode until completion.
- When post-copy migration switches from its pre-copy phase to the post-copy phase, there is a major but short-lived spike in guest CPU performance. What is happening here is that the guest has perhaps 80% of its RAM transferred to the target host when post-copy phase starts but the guest workload is touching some pages which are still on the source, so the page fault is having to wait for the page to be transferred across the network. The magnitude of the spike and duration of the post-copy phase is related to the total guest RAM size and bandwidth available. Taking the remote TCP case with 1 vCPU, 1 GB RAM hardware config for clarity, and comparing the “Post-copy 5 iters, 1Gbs” scenario with the “Post-copy 5 iters, 10Gbs” scenario, we can see the magnitude of the spike in guest performance is the same order of magnitude in both cases. The overall time for each iteration of pre-copy phase is clearly shorter in the 10Gbs case. If we further compare with the local UNIX domain socket, we can see the spike in performance is much lower at the post-copy phase. What this is telling us is that the magnitude of the spike in the post-copy phase is largely driven by the latency in the time to transfer an out of band requested page from the source to the target, rather than the overall bandwidth available. There are plans in QEMU to allow migration to use multiple TCP connections which should significantly reduce the post-copy latency spike as the out-of-band requested pages will not get stalled behind a long TCP transmit queue for the background bulk-copy.
- Auto-converge will often struggle to ensure convergence for larger guest sizes or when the bandwidth is limited. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing effects of different bandwidth limits we can see that with a 10Gbs bandwidth cap, auto-converge had to throttle to 80% to allow completion, while other tests show as much as 95% or even 99% in some cases. With a lower bandwidth limit of 1Gbs, the test case timed out after 5 minutes of running, having only attempted throttled down by 20%, showing auto-converge is not nearly aggressive enough when faced with low bandwidth links. The worst case guest performance seen when running auto-converge with CPUs throttled to 80% was on a par with that seen with post-copy immediately after switching to post-copy phase. The difference is that auto-converge shows that worst-case hit for a very long time during pre-copy, potentially many minutes, where as post-copy only showed it for a few seconds.
- Multi-thread compression was actively harmful to chances of a successful migration. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing thread counts, we can see that increasing the number of threads actually made performance worse, with less iterations over RAM being completed before the 5 minute timeout was hit. The longer each iteration takes the more time the guest has to dirty RAM, so the less likely migration is to complete. There are two factors believe to be at work here to make MT compression results so bad. First, as noted earlier QEMU is confined to 4 pCPUs, so with 4 vCPUs running, the compression threads have to compete for time with the vCPU threads slowing down speed of compression. The stress test workload run in the guest is writing completely random bytes which are a pathological input dataset for compression, allowing almost no compression. Given the fact the compression was CPU limited though, even if there had been a good compression ratio, it would be unlikely to have a significant benefit since the increased time to iterate over RAM would allow the guest to dirty more data eliminating the advantage of compressing it. If the QEMU emulator threads were given dedicated host pCPUs to run on it may have increased the performance somewhat, but then that assumes the host has CPUs free that are not running other guests.
- XBZRLE compression fared a little better than MT compression. Again considering the 4 vCPU, 8 GB RAM remote TCP test comparing RAM cache sizing, we can see that the time required for each iteration over RAM did not noticeably increase. This shows that while XBZRLE compression did have a notable impact on guest CPU performance, is not seeing a major bottleneck on processing of each page as compared to MT compression. Again though, it did not help to achieve migration completion, with all tests timing out after 5 minutes or 30 iterations over RAM. This is due to the fact that the guest stress workload is again delivering input data that hits the pathological worst case in the algorithm. Faced with such a workload, no matter how much CPU time or RAM cache is available, XBZRLE can never have any positive impact on migration.
- The RDMA data transport showed up a few of its quirks. First, by looking at the RDMA results comparing pause bandwidth, we can clearly identify a bug in QEMU’s RDMA implementation – it is not honouring the requested bandwidth limits – it always transfers at maximum link speed. Second, all the post-copy results show failure, confirming that post-copy is currently not compatible with RDMA migration. When comparing 10Gbs RDMA against 10Gbs TCP transports, there is no obvious benefit to using RDMA – it was not any more likely to complete migration in any of the test scenarios.
Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link
It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.
There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.
In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.
In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.