If you are using a physical machine, the chances are that the only things you hotplug are USB devices, or more occasionally SCSI disk drives, never PCI devices themselves. Although you might flash upgrade the BIOS, doing so is not all that likely to change the machine ABI seen by the operating system. The world of virtual machines though, is not quite so static. It is common for administrators to want to hotplug CPUs, memory, USB devices, PCI devices & disk drives. Upgrades of the underlying virtualization software will also bring in new features & capabilities to existing configured devices in the virtual machine, potentially changing the machine ABI seen by the guest OS. Some operating systems (Linux !) cope with these kind of changes without batting an eyelid. Other operating systems (Windows !) get very upset and decide you have to re-activate the license keys for your operating system, or re-configure your devices.
Stable machine ABIs
The first problem we addressed in libvirt was that of providing a stable machine ABI, so that when you upgrade QEMU/KVM, the guest doesn’t unexpectedly break. QEMU has always had the concept of a “machine type”. In the x86 emulator this let you switch between a plain old boring ISA only PC, and a new ISA+PCI enabled PC. The libvirt capabilities XML format exposed the allowed machine types for each guest architecture
<guest> <os_type>hvm</os_type> <arch name='i686'> <wordsize>32</wordsize> <emulator>/usr/bin/qemu</emulator> <machine>pc</machine> <machine>isapc</machine> <domain type='qemu'> </domain> </arch> </guest>
This tells you that the i686 emulator with path /usr/bin/qemu supports two machine types “pc” and “isapc”. This information then appears in the guest XML
<domain type='kvm'> <name>plain</name> <uuid>c7a1edbd-edaf-9455-926a-d65c16db1809</uuid> <os> <type arch='i686' machine='pc'>hvm</type> ...snip... </os> ...snip... </domain>
To support a stable machine ABI, the QEMU developers introduced the idea of versioned machine types. Instead of just “pc”, there is now “pc-0.10”, “pc-0.11”, etc new versions being added for each QEMU release that changes something in the machine type. The original “pc” machine type is declared to be an alias pointing to the latest version. libvirt captures this information in the capabilities XML, by listing all machine types and using the “canonical” attribute to link up the alias
<guest> <os_type>hvm</os_type> <arch name='i686'> <wordsize>32</wordsize> <emulator>/usr/bin/qemu</emulator> <machine>pc-0.11</machine> <machine canonical='pc-0.11'>pc</machine> <machine>pc-0.10</machine> <machine>isapc</machine> <domain type='qemu'> </domain> </guest>
Comparing to the earlier XML snippet, you can see 2 new machine types “pc-0.10” and “pc-0.11”, and can determine that “pc” is an alias for “pc-0.11”. The next clever part is that when you define / create a new guest in libvirt, if you specify a machine type of “pc”, then libvirt will automatically canonicalize this to “pc-0.11” for you. This means an application developer never need worry about this most of the time, they will automatically get a stable versioned machine ABI for all their guests
PCI device addressing
The second problem faced is that of ensuring that a device’s PCI address does not randomly change across reboots, or even across host migrations. This problem was primarily caused by the fact that you can hotplug/unplug arbitrary PCI devices at runtime. As an example, a guest might boot with 2 disks disks and 1 NIC, a total of 3 PCI devices. These get assigned PCI slots 4, 5 and 6 respectively. The admin then unplugs one of virtio disk. When the guest then reboots, or migrates to another host, QEMU will assign PCI slots 4 & 5. In other words the NIC has unexpectedly moved from slot 6 to slot 5. Migration is unlikely to be successful when this happens! The first roadblock in attempting to solve this problem was that QEMU did not provide any way for a management application to specify a PCI address for devices. As of the QEMU 0.12 release though, this limitation is finally removed with the introduction of a new generic ‘-device’ argument for configuring virtual hardware devices. QEMU only supports a single PCI bus and no bridges, so we can only configure the PCI slot number at this time, but this is sufficient. As part of a giant patch series I switched libvirt over to use this new syntax.
In implementing this, it was neccessary to add a way to record the PCI addresses in the libvirt guest XML format. After a little head scratching we settled on adding a generic “address” element to every single device in the libvirt XML. This example shows what it looks like in the context of a NIC definition
<interface type='network'> <mac address='52:54:00:f7:c5:0e'/> <source network='default'/> <target dev='vnet0'/> <model type='e1000'/> <alias name='e1000-nic2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </interface>
Requiring every management application to assign PCI addresses to devices was not at all desirable though, they should not need to think about this kind of thing. Thus whenever a new guest is defined in libvirt with the QEMU/KVM driver, the first thing libvirt does is to issue unique PCI addresses to every single device in the XML. This neatly side-steps the minor problem of having to tell apps that PCI slot 1 is reserved for the PIIX3, slot 2 for the VGA controller and slot 3 for the balloon device. So an application defines a guest without addresses and then can query libvirt for the updated XML to discover what addresses were assigned. For older QEMU < 0.12, we can’t support static PCI addressing due to lack of the “-device” argument, but libvirt will still report on what addresses were actually assigned by QEMU at runtime.
At time of hotplugging a device, the same principals are applied. The application passes in the device XML without any address info and libvirt assigns a PCI slot. The only minor caveat is that if an application then invokes the “define” operation to make the new device persistent, it should take care to copy across the auto-assigned address. This limitation will be addressed in the near future with an improved libvirt hotplug API.
Disk controllers & drive addressing
At around the same time that I was looking at static PCI addressing, Wolfgang Mauerer, was proposing a way to fix the SCSI disk hotplug support in libvirt. The problem here was that at boot time, if you listed 4 SCSI disks, you would get 1 SCSI controller with 4 SCSI disks attached. Meanwhile if you tried hotplugging 4 SCSI disks, you’d get 4 SCSI controllers each with 1 disk attached. In other words we were faking SCSI disk hotplug, by attaching entirely new SCSI controllers. It doesn’t take a genius to work out that this is going to crash & burn at next reboot, or migration, when those hotplugged SCSI controllers all disappear and the disks end up back on a single controller. The solution here was to explicitly model the idea of a disk controller in the libvirt guest XML, separate from the disk itself. For this task, we invented a fairly simple new XML syntax that looks like this
<controller type='scsi' index='0'> <alias name='scsi1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0c' function='0x0'/> </controller>
Each additional SCSI controller defined would get a new “index” value , and of course has a unique auto-assigned PCI address too. Notice how the “address” element has a type attribute. This bit of forward planning allows libvirt to introduce new address types later. This capability is now used to link disks to controllers, but defining a “drive” address type, consisting of a controller ID, bus ID and unit ID. In this example, the SCSI disk “sdz” is linked to controller 3, attached as unit 4.
<disk type='file' device='disk'> <source file='/home/berrange/output.img'/> <target dev='sdz' bus='scsi'/> <alias name='scsi3-0-4'/> <address type='drive' controller='3' bus='0' unit='4'/> </disk>
Since no existing libvirt application knows about the “controller” or “address” elements, libvirt will automatically populate these as required. So in this example, if the SCSI disk were specified without any “address” element, libvirt will add one, figuring out controller & unit properaties based onthe device name “sdz”. Each SCSI controller can have 7 units attached, and “sdz” corresponds to drive number 26, hence ending up controller 3 (26 / 7) and unit 4 ((26 % 7)-1). Once we added controllers and drive addresses for SCSI disks, naturally the same was also done for IDE and floppy disks. Virtio disks are a little special in that there is no separate VirtIO disk controller, every VirtIO disk is a separate PCI device. This is somewhat of a scalability problem, since QEMU only has 31 PCI slots and 3 of those are already taken up with the PIIX3, VGA adapter & balloon driver. Needless to say, we hope that QEMU will implement a PCI bridge soon, or introduce a real VirtIO disk controller.