Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA

Posted: February 16th, 2017 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: , , , , , | 1 Comment »

Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.

First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.

# curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo
# dnf upgrade

For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install

# cd /var/lib/libvirt/images
# wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
# virt-install --name f25x86_64  \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --ram 8000 --vcpus 8 \
    ...

The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.

    ...
    --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
               cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
               cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
    ...

QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).

    ...
    --machine q35

The complete virt-install command line thus looks like

# virt-install --name f25x86_64  \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --ram 8000 --vcpus 8 \
    --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
               cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
               cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
    --machine q35

Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”

<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='dmi-to-pci-bridge'>
  <model name='i82801b11-bridge'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
</controller>
<controller type='pci' index='2' model='pci-bridge'>
  <model name='pci-bridge'/>
  <target chassisNr='2'/>
  <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</controller>

PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config

<controller type='pci' index='3' model='pcie-expander-bus'>
  <target busNr='180'>
    <node>0</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='4' model='pcie-expander-bus'>
  <target busNr='200'>
    <node>1</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</controller>
<controller type='pci' index='5' model='pcie-expander-bus'>
  <target busNr='220'>
    <node>2</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</controller>

It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB

<controller type='pci' index='6' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='6' port='0x0'/>
  <alias name='pci.6'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='7' port='0x8'/>
  <alias name='pci.7'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='8' port='0x10'/>
  <alias name='pci.8'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='9' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='9' port='0x0'/>
  <alias name='pci.9'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='10' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='10' port='0x8'/>
  <alias name='pci.10'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='11' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='11' port='0x10'/>
  <alias name='pci.11'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='12' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='12' port='0x0'/>
  <alias name='pci.12'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='13' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='13' port='0x8'/>
  <alias name='pci.13'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='14' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='14' port='0x10'/>
  <alias name='pci.14'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>|
</controller>

Notice that the values in ‘bus‘ attribute on the <address> element is matching the value of the ‘index‘ attribute on the <controller> element of the parent device in the topology. The PCI controller topology now looks like this

pcie-root (index == 0)
  |
  +- dmi-to-pci-bridge (index == 1)
  |    |
  |    +- pci-bridge (index == 2)
  |
  +- pcie-expander-bus (index == 3, numa node == 0)
  |    |
  |    +- pcie-root-port (index == 6)
  |    +- pcie-root-port (index == 7)
  |    +- pcie-root-port (index == 8)
  |
  +- pcie-expander-bus (index == 4, numa node == 1)
  |    |
  |    +- pcie-root-port (index == 9)
  |    +- pcie-root-port (index == 10)
  |    +- pcie-root-port (index == 11)
  |
  +- pcie-expander-bus (index == 5, numa node == 2)
       |
       +- pcie-root-port (index == 12)
       +- pcie-root-port (index == 13)
       +- pcie-root-port (index == 14)

All the existing devices are attached to the “pci-bridge” (the controller with index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add

<interface type='user'>
  <mac address='52:54:00:7e:6e:c6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:c7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:c8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
</interface>

Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.

The final configuration change is to simply add the Intel IOMMU device

<iommu model='intel'/>

It is a capability integrated into the chipset, so it does not need any <address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.

# virsh domifaddr f25x86_64
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      52:54:00:10:26:7e    ipv4         192.168.122.3/24

# ssh root@192.168.122.3

We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes

# dnf install numactl
# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3
node 0 size: 3856 MB
node 0 free: 3730 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1813 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1832 MB
node distances:
node   0   1   2 
  0:  10  20  20 
  1:  20  10  20 
  2:  20  20  10 

Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached

# lspci -t -v
-+-[0000:dc]-+-00.0-[dd]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[de]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[df]----00.0  Intel Corporation 82574L Gigabit Network Connection
 +-[0000:c8]-+-00.0-[c9]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[ca]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[cb]----00.0  Intel Corporation 82574L Gigabit Network Connection
 +-[0000:b4]-+-00.0-[b5]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[b6]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[b7]----00.0  Intel Corporation 82574L Gigabit Network Connection
 \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
             +-01.0  Red Hat, Inc. QXL paravirtual graphic card
             +-02.0  Red Hat, Inc. Device 000b
             +-03.0  Red Hat, Inc. Device 000b
             +-04.0  Red Hat, Inc. Device 000b
             +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
             +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
             +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
             +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
             +-1e.0-[01-02]----01.0-[02]--+-01.0  Red Hat, Inc Virtio network device
             |                            +-02.0  Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller
             |                            +-03.0  Red Hat, Inc Virtio console
             |                            +-04.0  Red Hat, Inc Virtio block device
             |                            \-05.0  Red Hat, Inc Virtio memory balloon
             +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
             +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
             \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.

# vi /etc/default/grub
....add "intel_iommu=on"...
# grub2-mkconfig > /etc/grub2.cfg

While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio

echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \
  /etc/modprobe.d/vfio.conf

This is also a good time to install libvirt and KVM inside the guest

# dnf groupinstall "Virtualization"
# dnf install libvirt-client
# rm -f /etc/libvirt/qemu/networks/autostart/default.xml

Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.

Now reboot the guest. When it comes back up, there should be a /dev/kvm device present in the guest.

# ls -al /dev/kvm
crw-rw-rw-. 1 root kvm 10, 232 Oct  4 12:14 /dev/kvm

If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.

The IOMMU should have been detected and activated

# dmesg  | grep -i DMAR
[    0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS  BXPCDMAR 00000001 BXPC 00000001)
[    0.000000] DMAR: IOMMU enabled
[    0.203737] DMAR: Host address width 39
[    0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
[    0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02
[    2.910862] DMAR: No RMRR found
[    2.910863] DMAR: No ATSR found
[    2.914870] DMAR: dmar0: Using Queued invalidation
[    2.914924] DMAR: Setting RMRR:
[    2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC
[    2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[    2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O

The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.

The IOMMU should also have registered the PCI devices into various groups

# dmesg  | grep -i iommu  |grep device
[    2.915212] iommu: Adding device 0000:00:00.0 to group 0
[    2.915226] iommu: Adding device 0000:00:01.0 to group 1
...snip...
[    5.588723] iommu: Adding device 0000:b5:00.0 to group 14
[    5.588737] iommu: Adding device 0000:b6:00.0 to group 15
[    5.588751] iommu: Adding device 0000:b7:00.0 to group 16

Libvirt meanwhile should have detected all the PCI controllers/devices

# virsh nodedev-list --tree
computer
  |
  +- net_lo_00_00_00_00_00_00
  +- pci_0000_00_00_0
  +- pci_0000_00_01_0
  +- pci_0000_00_02_0
  +- pci_0000_00_03_0
  +- pci_0000_00_04_0
  +- pci_0000_00_1d_0
  |   |
  |   +- usb_usb2
  |       |
  |       +- usb_2_0_1_0
  |         
  +- pci_0000_00_1d_1
  |   |
  |   +- usb_usb3
  |       |
  |       +- usb_3_0_1_0
  |         
  +- pci_0000_00_1d_2
  |   |
  |   +- usb_usb4
  |       |
  |       +- usb_4_0_1_0
  |         
  +- pci_0000_00_1d_7
  |   |
  |   +- usb_usb1
  |       |
  |       +- usb_1_0_1_0
  |       +- usb_1_1
  |           |
  |           +- usb_1_1_1_0
  |             
  +- pci_0000_00_1e_0
  |   |
  |   +- pci_0000_01_01_0
  |       |
  |       +- pci_0000_02_01_0
  |       |   |
  |       |   +- net_enp2s1_52_54_00_10_26_7e
  |       |     
  |       +- pci_0000_02_02_0
  |       +- pci_0000_02_03_0
  |       +- pci_0000_02_04_0
  |       +- pci_0000_02_05_0
  |         
  +- pci_0000_00_1f_0
  +- pci_0000_00_1f_2
  |   |
  |   +- scsi_host0
  |   +- scsi_host1
  |   +- scsi_host2
  |   +- scsi_host3
  |   +- scsi_host4
  |   +- scsi_host5
  |     
  +- pci_0000_00_1f_3
  +- pci_0000_b4_00_0
  |   |
  |   +- pci_0000_b5_00_0
  |       |
  |       +- net_enp181s0_52_54_00_7e_6e_c6
  |         
  +- pci_0000_b4_01_0
  |   |
  |   +- pci_0000_b6_00_0
  |       |
  |       +- net_enp182s0_52_54_00_7e_6e_c7
  |         
  +- pci_0000_b4_02_0
  |   |
  |   +- pci_0000_b7_00_0
  |       |
  |       +- net_enp183s0_52_54_00_7e_6e_c8
  |         
  +- pci_0000_c8_00_0
  |   |
  |   +- pci_0000_c9_00_0
  |       |
  |       +- net_enp201s0_52_54_00_7e_6e_d6
  |         
  +- pci_0000_c8_01_0
  |   |
  |   +- pci_0000_ca_00_0
  |       |
  |       +- net_enp202s0_52_54_00_7e_6e_d7
  |         
  +- pci_0000_c8_02_0
  |   |
  |   +- pci_0000_cb_00_0
  |       |
  |       +- net_enp203s0_52_54_00_7e_6e_d8
  |         
  +- pci_0000_dc_00_0
  |   |
  |   +- pci_0000_dd_00_0
  |       |
  |       +- net_enp221s0_52_54_00_7e_6e_e6
  |         
  +- pci_0000_dc_01_0
  |   |
  |   +- pci_0000_de_00_0
  |       |
  |       +- net_enp222s0_52_54_00_7e_6e_e7
  |         
  +- pci_0000_dc_02_0
      |
      +- pci_0000_df_00_0
          |
          +- net_enp223s0_52_54_00_7e_6e_e8

And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of

# virsh nodedev-dumpxml pci_0000_df_00_0
<device>
  <name>pci_0000_df_00_0</name>
  <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path>
  <parent>pci_0000_dc_02_0</parent>
  <driver>
    <name>e1000e</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>223</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x10d3'>82574L Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='10'>
      <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/>
      <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='2'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='1'/>
      <link validity='sta' speed='2.5' width='1'/>
    </pci-express>
  </capability>
</device>

Finally, libvirt should also be reporting the NUMA topology

# virsh capabilities
...snip...
<topology>
  <cells num='3'>
    <cell id='0'>
      <memory unit='KiB'>4014464</memory>
      <pages unit='KiB' size='4'>1003616</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='10'/>
        <sibling id='1' value='20'/>
        <sibling id='2' value='20'/>
      </distances>
      <cpus num='4'>
        <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
        <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
        <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
        <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
      </cpus>
    </cell>
    <cell id='1'>
      <memory unit='KiB'>2016808</memory>
      <pages unit='KiB' size='4'>504202</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='20'/>
        <sibling id='1' value='10'/>
        <sibling id='2' value='20'/>
      </distances>
      <cpus num='2'>
        <cpu id='4' socket_id='4' core_id='0' siblings='4'/>
        <cpu id='5' socket_id='5' core_id='0' siblings='5'/>
      </cpus>
    </cell>
    <cell id='2'>
      <memory unit='KiB'>2014644</memory>
      <pages unit='KiB' size='4'>503661</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='20'/>
        <sibling id='1' value='20'/>
        <sibling id='2' value='10'/>
      </distances>
      <cpus num='2'>
        <cpu id='6' socket_id='6' core_id='0' siblings='6'/>
        <cpu id='7' socket_id='7' core_id='0' siblings='7'/>
      </cpus>
    </cell>
  </cells>
</topology>
...snip...

Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device

# cd /var/lib/libvirt/images
# wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
# virt-install --name f25x86_64 --ram 2000 --vcpus 8 \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --hostdev pci_0000_df_00_0 --network none

If everything went well, you should now have a nested guest with an assigned PCI device attached to it.

This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.

The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)

ANNOUNCE: libosinfo 1.0.0 release

Posted: February 16th, 2017 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: , | No Comments »

NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…

I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Update loader to follow new layout for external database
  • Move all database files into separate osinfo-db package
  • Move osinfo-db-validate into osinfo-db-tools package

As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:

  • libosinfo – provides the libosinfo shared library and most associated command line tools
  • osinfo-db – contains only the database XML files and RNG schema, no code at all.
  • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.

The surprisingly complicated world of disk image sizes

Posted: February 10th, 2017 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: , , | No Comments »

When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:

  • Capacity – the size that is visible to the guest OS
  • Length – the current highest byte offset in the file.
  • Allocation – the amount of storage that is currently consumed.
  • Commitment – the amount of storage that could be consumed in the future.

The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.

Raw files

In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.

While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.

QCow2 files

In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.

Summary

Considering the above information, for a newly created file the four size values would look like

Format Capacity Length Allocation Commitment
raw (sparse) 40GB 40GB 0 40GB [1]
raw (prealloc) 40GB 40GB 40GB [1] 40GB [1]
qcow2 (grow on demand) 40GB 193KB 196KB 41GB [2]
qcow2 (prealloc metadata) 40GB 41GB 6.5MB 41GB [2]
qcow2 (prealloc all) 40GB 41GB 41GB 41GB [2]
[1] XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB
[2] use of internal snapshots may massively increase allocation/commitment

For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.

That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.

Commenting out XML snippets in libvirt guest config by stashing it as metadata

Posted: February 8th, 2017 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: , , | 3 Comments »

Libvirt uses XML as the format for configuring objects it manages, including virtual machines. Sometimes when debugging / developing it is desirable to comment out sections of the virtual machine configuration to test some idea. For example, one might want to temporarily remove a secondary disk. It is not always desirable to just delete the configuration entirely, as it may need to be re-added immediately after. XML has support for comments <!-- .... some text --> which one might try to use to achieve this. Using comments in XML fed into libvirt, however, will result in an unwelcome suprise – the commented out text is thrown into /dev/null by libvirt.

This is an unfortunate consequence of the way libvirt handles XML documents. It does not consider the XML document to be the master representation of an object’s configuration – a series of C structs are the actual internal representation. XML is simply a data interchange format for serializing structs into a text format that can be interchanged with the management application, or persisted on disk. So when receiving an XML document libvirt will parse it, extracting the pieces of information it cares about which are they stored in memory in some structs, while the XML document is discarded (along with the comments it contained). Given this way of working, to preserve comments would require libvirt to add 100’s of extra fields to its internal structs and extract comments from every part of the XML document that might conceivably contain them. This is totally impractical to do in realityg. The alternative would be to consider the parsed XML DOM as the canonical internal representation of the config. This is what the libvirt-gconfig library in fact does, but it means you can no longer just do simple field accesses to access information – getter/setter methods would have to be used, which quickly becomes tedious in C. It would also involve re-refactoring almost the entire libvirt codebase so such a change in approach would realistically never be done.

Given that it is not possible to use XML comments in libvirt, what other options might be available ?

Many years ago libvirt added the ability to store arbitrary user defined metadata in domain XML documents. The caveat is that they have to be located in a specific place in the XML document as a child of the <metadata> tag, in a private XML namespace. This metadata facility to be used as a hack to temporarily stash some XML out of the way. Consider a guest which contains a disk to be “commented out”:

<domain type="kvm">
  ...
  <devices>
    ...
    <disk type='file' device='disk'>
    <driver name='qemu' type='raw'/>
    <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>
    ...
  </devices>
</domain>

To stash the disk config as a piece of metadata requires changing the XML to

<domain type="kvm">
  ...
  <metadata>
    <s:disk xmlns:s="http://stashed.org/1" type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </s:disk>
  </metadata>
  ...
  <devices>
    ...
  </devices>
</domain>

What we have done here is

– Added a <metadata> element at the top level
– Moved the <disk> element to be a child of <metadata> instead of a child of <devices>
– Added an XML namespace to <disk> by giving it an ‘s:’ prefix and associating a URI with this prefix

Libvirt only allows a single top level metadata element per namespace, so if there are multiple tihngs to be stashed, just give them each a custom namespace, or introduce an arbitrary wrapper. Aside from mandating the use of a unique namespace, libvirt treats the metadata as entirely opaque and will not try to intepret or parse it in any way. Any valid XML construct can be stashed in the metadata, even invalid XML constructs, provided they are hidden inside a CDATA block. For example, if you’re using virsh edit to make some changes interactively and want to get out before finishing them, just stash the changed in a CDATA section, avoiding the need to worry about correctly closing the elements.

<domain type="kvm">
  ...
  <metadata>
    <s:stash xmlns:s="http://stashed.org/1">
    <![CDATA[
      <disk type='file' device='disk'>
        <driver name='qemu' type='raw'/>
        <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
        <target dev='vda' bus='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      </disk>
      <disk>
        <driver name='qemu' type='raw'/>
        ...i'll finish writing this later...
    ]]>
    </s:stash>
  </metadata>
  ...
  <devices>
    ...
  </devices>
</domain>

Admittedly this is a somewhat cumbersome solution. In most cases it is probably simpler to just save the snippet of XML in a plain text file outside libvirt. This metadata trick, however, might just come in handy some times.

As an aside the real, intended, usage of the <metdata> facility is to allow applications which interact with libvirt to store custom data they may wish to associated with the guest. As an example, the recently announced libvirt websockets console proxy uses it to record which consoles are to be exported. I know of few other real world applications using this metadata feature, however, so it is worth remembering it exists :-) System administrators are free to use it for local book keeping purposes too.

ANNOUNCE: new libvirt console proxy project

Posted: January 26th, 2017 | Filed under: Fedora, libvirt, OpenStack, Security | Tags: , , , , | No Comments »

This post is to announce the existence of a new small project to provide a websockets proxy explicitly targeting virtual machines serial consoles and SPICE/VNC graphical displays.

Background

Virtual machines will typically expose one or more consoles for the user to interact on, whether plain text consoles or graphical consoles via VNC/SPICE. With KVM, the network server for these consoles is run as part of the QEMU process on the compute node. It has become common practice to not directly expose these network services to the end user. Instead large scale deployments will typically run some kind of proxy service sitting between the end user and their VM’s console(s). Typically this proxy will tunnel the connection over the websockets protocol too. This has a number of advantages to users and admins alike. By using websockets, it is possible to use an HTTP parameter or cookie to identify which VM to connect to. This means that a single public network port can multiplex access to all the virtual machines, dynamically figuring out which to connect to. The use of websockets also makes it possible to write in-browser clients (noVNC, SPICE-HTML5) to display the console, avoiding the need to install local desktop apps.

There are already quite a few implementations of the websockets proxy idea out there used by virt/cloud management apps, so one might wonder why another one is needed. The existing implementations generally all work at the TCP layer, meaning they treat the content being carried as opaque data. This means that while they provide security between the end user & proxy server by use of HTTPS for the websockets connection, the internal communication between the proxy and QEMU process is still typically running in cleartext. For SPICE and serial consoles it is possible to add security between the proxy and QEMU by simply layering TLS over the entire protocol, so again the data transported can be considered opaque. This isn’t possible for VNC though. To get encryption between the proxy server and QEMU requires interpreting the VNC protocol to intercept the authentication scheme negotiation, turning on TLS support. IOW, the proxy server cannot treat the VNC data stream as opaque.

The libvirt-console-proxy project was started specifically to address this requirement for VNC security. The proxy will interpret the VNC protocol handshake, responding to the client user’s auth request (which is for no auth at the VNC layer, since the websockets layer already handled auth from the client’s POV). The proxy will start a new auth scheme handshake with QEMU and request the VeNCrypt scheme be used, which enables x509 certificate mutual validation and TLS encryption of the data stream. Once the security handshake is complete, the proxy will allow the rest of the VNC protocol to run as a black-box once more – it simply adds TLS encryption when talking to QEMU. This VNC MITM support is already implemented in the libvirt-console-proxy code and enabled by default.

As noted earlier, SPICE is easier to deal with, since it has separate TCP ports for plain vs TLS traffic. The proxy simply needs to connect to the TLS port instead of the TCP port, and does not need to speak the SPICE protocol. There is, however, a different complication with SPICE. A logical SPICE connection from a client to the server actually consists of many network connections as SPICE uses a separate socket for each type of data (keyboard events, framebuffer updates, audio, usb tunnelling, etc). In fact there can be as many as 10 network connections for a single SPICE connection. The websockets proxy will normally expect some kind of secret token to be provided by the client both as authentication, and to identify which SPICE server to connect to. It can be highly desirable for such tokens to be single-use only. Since each SPICE connection requires multiple TCP connections to be established, the proxy has no way of knowing when it can invalidate further use of the token. To deal with this it needs to insert itself into the SPICE protocol negotiation. This allows it to see the protocol message from the server enumerating which channels are available, and also see the unique session identifier. When the additional SPICE network connections arrive, the proxy can now validate that the session ID matches the previously identified session ID, and reject re-use of the websockets security token if they don’t match. It can also permanently invalidate the security token, once all expected secondary connections are established. At time of writing, this SPICE MITM support is not yet implemented in the libvirt-console-proxy code, but is is planned.

Usage

The libvirt-console-proxy project is designed to be reusable by any virtualization management project and so is split into two parts. The general purpose “virtconsoleproxyd” daemon exposes the websockets server and actually runs the connections to the remote QEMU server(s). This daemon is designed to be secure by default, so it mandates configuration of x509 certificates for both the websockets server and the internal VNC, SPICE & serial console connections to QEMU. Running in cleartext requires explicit “--listen-insecure” and “--connect-insecure” flags to make it clear this is a stupid thing to be doing.

The “virtconsoleproxyd” still needs some way to identify which QEMU server(s) it should connect to for a given incoming websockets connection. To address this there is the concept of an external “token resolver” service. This is a daemon running somewhere (either co-located, or on a remote host), providing a trivial REST service. The proxy server does a GET to this external resolver service, passing along the token value. The resolver validates the token, determines what QEMU server it corresponds to and passes this info back to “virtconsoleproxyd“. This way, each project virt management project can deploy the “virtconsoleproxyd” service as is, and simply provide a custom “token resolver” service to integrate with whatever application specific logic is needed to identify QEMU.

There is also a reference implementation of the token resolver interface called “virtconsoleresolverd“. This resolver has a connection to one or more libvirtd daemons and monitors for running virtual machines on each one. When a virtual machine starts, it’ll query the libvirt XML for that machine and extract a specific piece of metadata from the XML under the namespace http://libvirt.org/schemas/console-proxy/1.0. This metadata identifies a libvirt secret object that provides the token for connecting to that guest console. As should be expected, the “virtconsoleresolverd” requires use of HTTPS by default, refusing to run cleartext unless the “--listen-insecure” option is given.

For example on a guest which has a SPICE graphics console, a serial ports and two virtio-console devices, the metadata would look like

$ virsh  metadata demo http://libvirt.org/schemas/console-proxy/1.0
<consoles>
  <console token="999f5742-2fb5-491c-832b-282b3afdfe0c" type="spice" port="0" insecure="no"/>
  <console token="6a92ef00-6f54-4c18-820d-2a2eaf9ac309" type="serial" port="0" insecure="no"/>
  <console token="3d7bbde9-b9eb-4548-a414-d17fa1968aae" type="console" port="0" insecure="no"/>
  <console token="393c6fdd-dbf7-4da9-9ea7-472d2f5ad34c" type="console" port="1" insecure="no"/>
</consoles>

Each of those tokens refers to a libvirt secret

$ virsh -c sys secret-dumpxml 999f5742-2fb5-491c-832b-282b3afdfe0c
<secret ephemeral='no' private='no'>
  <uuid>999f5742-2fb5-491c-832b-282b3afdfe0c</uuid>
  <description>Token for spice console proxy domain d54df46f-1ab5-4a22-8618-4560ef5fac2c</description>
</secret>

And that secret has a value associated which is the actual security token to be provided when connecting to the websockets proxy

$ virsh -c sys secret-get-value 999f5742-2fb5-491c-832b-282b3afdfe0c | base64 -d
750d78ed-8837-4e59-91ce-eef6800227fd

So in this example, to access the SPICE console, the remote client would provide the security token string “750d78ed-8837-4e59-91ce-eef6800227fd“. It happens that I’ve used UUIDs as the actual security token, but they can actually be in any format desired – random UUIDs just happen to be reasonably good high entropy secrets.

Setting up this metadata is a little bit tedious, so there is also a help program that can be used to automatically create the secrets, set a random value and write the metadata into the guest XML

$ virtconsoleresolveradm enable demo
Enabled access to domain 'demo'

Or to stop it again

$ virtconsoleresolveradm disable demo
Disabled access to domain 'demo'

If you noticed my previous announcements of libvirt-go and libvirt-go-xml it won’t come as a surprise to learn that this new project is also written in Go. In fact the integration of libvirt support in this console proxy project is what triggered my work on those other two projects. Having previously worked on adding the same kind of VNC MITM protocol support to the OpenStack nova-novncproxy server, I find I’m liking Go more and more compared to Python.