Guest CPU model configuration in libvirt with QEMU/KVM

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 8 Comments »

Many of the management problems in virtualization are caused by the annoyingly popular & desirable host migration feature! I previously talked about PCI device addressing problems, but this time the topic to consider is that of CPU models. Every hypervisor has its own policies for what a guest will see for its CPUs by default, Xen just passes through the host CPU, with QEMU/KVM the guest sees a generic model called “qemu32” or “qemu64”.  VMWare does something more advanced, classifying all physical CPUs into a handful of groups and has one baseline CPU model for each group that’s exposed to the guest. VMWare’s behaviour lets guests safely migrate between hosts provided they all have physical CPUs that classify into the same group. libvirt does not like to enforce policy itself, preferring just to provide the mechanism on which the higher layers define their own desired policy. CPU models are a complex subject, so it has taken longer than desirable to support their configuration in libvirt. In the 0.7.5 release that will be in Fedora 13, there is finally a comprehensive mechanism for controlling guest CPUs.

Learning about the host CPU model

If you have been following earlier articles (or otherwise know a bit about libvirt) you’ll know that the “virsh capabilities” command displays an XML document describing the capabilities of the hypervisor connection & host. It should thus come as no surprise that this XML schema has been extended to provide information about the host CPU model. One of the big challenges in describing a CPU models is that every architecture has different approach to exposing their capabilities. On x86, a modern CPUs’ capabilities are exposed via the CPUID instruction. Essentially this comes down to a set of 32-bit integers with each bit given a specific meaning. Fortunately AMD & Intel agree on common semantics for these bits. VMWare and Xen both expose the notion of CPUID masks directly in their guest configuration format. Unfortunately (or fortunately depending on your POV) QEMU/KVM support far more than just the x86 architecture, so CPUID is clearly not suitable as the canonical configuration format. QEMU ended up using a scheme which combines a CPU model name string, with a set of named flags. On x86 the CPU model  maps to a baseline CPUID mask, and the flags can be used to then toggle bits in the mask on or off. libvirt decided to follow this lead and use a combination of a model name and flags. Without further ado, here is an example of what libvirt reports as the capabilities of my laptop’s CPU

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>i686</arch>
      <model>pentium3</model>
      <topology sockets='1' cores='2' threads='1'/>
      <feature name='lahf_lm'/>
      <feature name='lm'/>
      <feature name='xtpr'/>
      <feature name='cx16'/>
      <feature name='ssse3'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='pni'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='sse2'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='clflush'/>
      <feature name='apic'/>
    </cpu>
    ...snip...

In it not practical to have a database listing all known CPU models, so libvirt has a small list of baseline CPU model names.  It picks the one that shares the greatest number of CPUID bits with the actual host CPU and then lists the remaining bits as named features. Notice that libvirt does not tell you what features the baseline CPU contains. This might seem like a flaw at first, but as will be shown next, it is not actually necessary to know this information.

Determining a compatible CPU model to suit a pool of hosts

Now that it is possible to find out what CPU capabilities a single host has, the next problem is to determine what CPU capabilities are best to expose to the guest. If it is known that the guest will never need to be migrated to another host,  the host CPU model can be passed straight through unmodified. Some lucky people might have a virtualized data center where they can guarantee all servers will have 100% identical CPUs. Again the host CPU model can be passed straight through unmodified. The interesting case though, is where there is variation in CPUs between hosts. In this case the lowest common denominator CPU must be determined. This is not entirely straightforward, so libvirt provides an API for exactly this task. Provide libvirt with a list of XML documents, each describing a host’s CPU model, and it will internally convert these to CPUID masks, calculate their intersection, finally converting the CPUID mask result back into an XML CPU description. Taking the CPU description from a random server

<capabilities>
  <host>
    <cpu>
      <arch>x86_64</arch>
      <model>phenom</model>
      <topology sockets='2' cores='4' threads='1'/>
      <feature name='osvw'/>
      <feature name='3dnowprefetch'/>
      <feature name='misalignsse'/>
      <feature name='sse4a'/>
      <feature name='abm'/>
      <feature name='cr8legacy'/>
      <feature name='extapic'/>
      <feature name='cmp_legacy'/>
      <feature name='lahf_lm'/>
      <feature name='rdtscp'/>
      <feature name='pdpe1gb'/>
      <feature name='popcnt'/>
      <feature name='cx16'/>
      <feature name='ht'/>
      <feature name='vme'/>
    </cpu>
    ...snip...

As a quick check is it possible to ask libvirt whether this CPU description is compatible with the previous laptop CPU description, using the “virsh cpu-compare” command

$ ./tools/virsh cpu-compare cpu-server.xml
CPU described in cpu-server.xml is incompatible with host CPU

libvirt is correctly reporting the CPUs are incompatible, because there are several features in the laptop CPU that are missing in the server CPU. To be able to migrate between the laptop and the server, it will be necessary to mask out some features, but which ones ? Again libvirt provides an API for this, also exposed via the “virsh cpu-baseline” command

# virsh cpu-baseline both-cpus.xml
<cpu match='exact'>
  <model>pentium3</model>
  <feature policy='require' name='lahf_lm'/>
  <feature policy='require' name='lm'/>
  <feature policy='require' name='cx16'/>
  <feature policy='require' name='monitor'/>
  <feature policy='require' name='pni'/>
  <feature policy='require' name='ht'/>
  <feature policy='require' name='sse2'/>
  <feature policy='require' name='clflush'/>
  <feature policy='require' name='apic'/>
</cpu>

libvirt has determined that in order to safely migrate a guest between the laptop and the server, it is neccesary to mask out 11 features from the laptop’s XML description.

Configuring the guest CPU model

To simplify life, the guest CPU configuration accepts the same basic XML representation as the host capabilities XML exposes. In other words, the XML from that “cpu-baseline” virsh command, can now be copied directly into the guest XML at the top level under the <domain> element. As the observant reader will have noticed from that last XML snippet, there are a few extra attributes available when describing a CPU in the guest XML. These can mostly be ignored, but for the curious here’s a quick description of what they do. The top level <cpu> element gets an attribute called “match” with possible values

  • match=”minimum” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will also be exposed to the guest
  • match=”exact” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will be masked out from the guest
  • match=”strict” – the host CPU must have exactly the same CPU features described in the guest XML. No more, no less.

The next enhancement is that the <feature> elements can each have an extra “policy” attribute with possible values

  • policy=”force” – expose the feature to the guest even if the host does not have it. This is kind of crazy, except in the case of software emulation.
  • policy=”require” – expose the feature to the guest and fail if the host does not have it. This is the sensible default.
  • policy=”optional” – expose the feature to the guest if it happens to support it.
  • policy=”disable” – if the host has this feature, then hide it from the guest
  • policy=”forbid” – if the host has this feature, then fail and refuse to start the guest

The “forbid” policy is for a niche scenario where a badly behaved application will try to use a feature even if it is not in the CPUID mask, and you wish to prevent accidentally running the guest on a host with that feature. The “optional” policy has special behaviour wrt migration. When the guest is initially started the flag is optional, but when the guest is live migrated, this policy turns into “require”, since you can’t have features disappearing across migration.

All the stuff described in this posting is currently implemented for libvirt’s QEMU/KVM driver, basic code in the 0.7.5/6 releases, but the final ‘cpu-baseline’ stuff is arriving in 0.7.7. Needless to say this will all be available in Fedora 13 and future RHEL. This obviously also needs to be ported over to the Xen and VMWare ESX drivers in libvirt, which isn’t as hard as it sounds, because libvirt has a very good internal API for processed CPUID masks now. Kudos to Jiri Denemark for doing all the really hardwork on this CPU modelling system!

Stable guest machine ABI, PCI addressing and disk controllers in libvirt

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 2 Comments »

If you are using a physical machine, the chances are that the only things you hotplug are USB devices, or more occasionally SCSI disk drives, never PCI devices themselves. Although you might flash upgrade the BIOS, doing so is not all that likely to change the machine ABI seen by the operating system. The world of virtual machines though, is not quite so static. It is common for administrators to want to hotplug CPUs, memory, USB devices, PCI devices & disk drives. Upgrades of the underlying virtualization software will also bring in new features & capabilities to existing configured devices in the virtual machine, potentially changing the machine ABI seen by the guest OS. Some operating systems (Linux !) cope with these kind of changes without batting an eyelid. Other operating systems (Windows !) get very upset and decide you have to re-activate the license keys for your operating system, or re-configure your devices.

Stable machine ABIs

The first problem we addressed in libvirt was that of providing a stable machine ABI, so that when you upgrade QEMU/KVM, the guest doesn’t unexpectedly break. QEMU has always had the concept of a “machine type”. In the x86 emulator this let you switch between a plain old boring ISA only PC, and a new ISA+PCI enabled PC. The libvirt capabilities XML format exposed the allowed machine types for each guest architecture

<guest>
  <os_type>hvm</os_type>
  <arch name='i686'>
    <wordsize>32</wordsize>
    <emulator>/usr/bin/qemu</emulator>
    <machine>pc</machine>
    <machine>isapc</machine>
    <domain type='qemu'>
    </domain>
  </arch>
</guest>

This tells you that the i686 emulator with path /usr/bin/qemu supports two machine types “pc” and “isapc”. This information then appears in the guest XML

<domain type='kvm'>
  <name>plain</name>
  <uuid>c7a1edbd-edaf-9455-926a-d65c16db1809</uuid>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    ...snip...
  </os>
  ...snip...
</domain>

To support a stable machine ABI, the QEMU developers introduced the idea of versioned machine types. Instead of just “pc”, there is now “pc-0.10”, “pc-0.11”, etc new versions being added for each QEMU release that changes something in the machine type. The original “pc” machine type is declared to be an alias pointing to the latest version. libvirt captures this information in the capabilities XML, by listing all machine types and using the “canonical” attribute to link up the alias

<guest>
 <os_type>hvm</os_type>
 <arch name='i686'>
   <wordsize>32</wordsize>
   <emulator>/usr/bin/qemu</emulator>
   <machine>pc-0.11</machine>
   <machine canonical='pc-0.11'>pc</machine>
   <machine>pc-0.10</machine>
   <machine>isapc</machine>
   <domain type='qemu'>
   </domain>
</guest>

Comparing to the earlier XML snippet, you can see 2 new machine types “pc-0.10” and “pc-0.11”, and can determine that “pc” is an alias for “pc-0.11”. The next clever part is that when you define / create a new guest in libvirt, if you specify a machine type of “pc”, then libvirt will automatically canonicalize this to “pc-0.11” for you. This means an application developer never need worry about this most of the time, they will automatically get a stable versioned machine ABI for all their guests

PCI device addressing

The second problem faced is that of ensuring that a device’s PCI address does not randomly change across reboots, or even across host migrations. This problem was primarily caused by the fact that you can hotplug/unplug arbitrary PCI devices at runtime. As an example, a guest might boot with 2 disks disks and 1 NIC, a total of 3 PCI devices. These get assigned PCI slots 4, 5 and 6 respectively. The admin then unplugs one of virtio disk. When the guest then reboots, or migrates to another host, QEMU will assign PCI slots 4 & 5. In other words the NIC has unexpectedly moved from slot 6 to slot 5. Migration is unlikely to be successful when this happens! The first roadblock in attempting to solve this problem was that QEMU did not provide any way for a management application to specify a PCI address for devices. As of the QEMU 0.12 release though, this limitation is finally removed with the introduction of a new generic ‘-device’ argument for configuring virtual hardware devices. QEMU only supports a single PCI bus and no bridges, so we can only configure the PCI slot number at this time, but this is sufficient. As part of a giant patch series I switched libvirt over to use this new syntax.

In implementing this, it was neccessary to add a way to record the PCI addresses in the libvirt guest XML format. After a little head scratching we settled on adding a generic “address” element to every single device in the libvirt XML. This example shows what it looks like in the context of a NIC definition

    <interface type='network'>
      <mac address='52:54:00:f7:c5:0e'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='e1000'/>
      <alias name='e1000-nic2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </interface>

Requiring every management application to assign PCI addresses to devices was not at all desirable though, they should not need to think about this kind of thing. Thus whenever a new guest is defined in libvirt with the QEMU/KVM driver, the first thing libvirt does is to issue unique PCI addresses to every single device in the XML. This neatly side-steps the minor problem of having to tell apps that PCI slot 1 is reserved for the PIIX3, slot 2 for the VGA controller and slot 3 for the balloon device. So an application defines a guest without addresses and then can query libvirt for the updated XML to discover what addresses were assigned. For older QEMU < 0.12, we can’t support static PCI addressing due to lack of the “-device” argument, but libvirt will still report on what addresses were actually assigned by QEMU at runtime.

At time of hotplugging a device, the same principals are applied. The application passes in the device XML without any address info and libvirt assigns a PCI slot. The only minor caveat is that if an application then invokes the “define” operation to make the new device persistent, it should take care to copy across the auto-assigned address. This limitation will be addressed in the near future with an improved libvirt hotplug API.

Disk controllers & drive addressing

At around the same time that I was looking at static PCI addressing, Wolfgang Mauerer, was proposing a way to fix the SCSI disk hotplug support in libvirt. The problem here was that at boot time, if you listed 4 SCSI disks, you would get 1 SCSI controller with 4 SCSI disks attached. Meanwhile if you tried hotplugging 4 SCSI disks, you’d get 4 SCSI controllers each with 1 disk attached. In other words we were faking SCSI disk hotplug, by attaching entirely new SCSI controllers. It doesn’t take a genius to work out that this is going to crash & burn at next reboot, or migration, when those hotplugged SCSI controllers all disappear and the disks end up back on a single controller. The solution here was to explicitly model the idea of a disk controller in the libvirt guest XML, separate from the disk itself. For this task, we invented a fairly simple new XML syntax that looks like this

    <controller type='scsi' index='0'>
      <alias name='scsi1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0c' function='0x0'/>
    </controller>

Each additional SCSI controller defined would get a new “index” value , and of course has a unique auto-assigned PCI address too. Notice how the “address” element has a type attribute. This bit of forward planning allows libvirt to introduce new address types later. This capability is now used to link disks to controllers, but defining a “drive” address type, consisting of a controller ID, bus ID and unit ID. In this example, the SCSI disk “sdz” is linked to controller 3, attached as unit 4.

    <disk type='file' device='disk'>
      <source file='/home/berrange/output.img'/>
      <target dev='sdz' bus='scsi'/>
      <alias name='scsi3-0-4'/>
      <address type='drive' controller='3' bus='0' unit='4'/>
    </disk>

Since no existing libvirt application knows about the “controller” or “address” elements, libvirt will automatically populate these as required. So in this example, if the SCSI disk were specified without any “address” element, libvirt will add one, figuring out controller & unit properaties based onthe device name “sdz”. Each SCSI controller can have 7 units attached, and “sdz” corresponds to drive number 26, hence ending up controller 3 (26 / 7) and unit 4 ((26 % 7)-1). Once we added controllers and drive addresses for SCSI disks, naturally the same was also done for IDE and floppy disks. Virtio disks are a little special in that there is no separate VirtIO disk controller, every VirtIO disk is a separate PCI device. This is somewhat of a scalability problem, since QEMU only has 31 PCI slots and 3 of those are already taken up with the PIIX3, VGA adapter & balloon driver. Needless to say, we hope that QEMU will implement a PCI bridge soon, or introduce a real VirtIO disk controller.