Controlling guest CPU & NUMA affinity in libvirt with QEMU, KVM & Xen

Posted: February 12th, 2010 | Filed under: libvirt, Virt Tools | 16 Comments »

When provisioning new guests with libvirt, the standard policy for affinity between the guest and host CPUs / NUMA nodes, is to have no policy at all. In other words the guest will follow whatever the hypervisor’s own default policy is, which is usually to run the guest on whatever host CPU is available. There are times when an explicit policy may be better, in particular to make the most of a NUMA architecture it is usually desirable to lock a guest to a particular NUMA node so that its memory allocations are always local to the node it is running on, avoiding the cross-node memory transports which have less bandwidth. As of writing, libvirt supports this capability for QEMU, KVM and Xen guests. Even on a non-NUMA system some form of explicit placement across the hosts’ sockets, cores & hyperthreads may be desired.

Querying host CPU / NUMA topology

The first step in deciding what policy to apply is to figure out the host’s topology is. The virsh nodeinfo command provides information about how many sockets, cores & hyperthreads there are on a host.

# virsh nodeinfo
CPU model:           x86_64
CPU(s):              8
CPU frequency:       1000 MHz
CPU socket(s):       2
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         8179176 kB

There are a total of 8 CPUs, in 2 sockets, each with 4 cores.

More interesting though is the NUMA topology. This can be significantly more complex, so the data is provided in a structured XML document, as part of the virsh capabilities output

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
    </cpu>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <cpus num='4'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <cpus num='4'>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
    </secmodel>
  </host>

 ...removed remaining XML...

</capabilities>

This tells us that there are two NUMA nodes (aka cells), each containing 4 logical CPUs. Since we know there are two sockets, we can obviously infer from this that each socket is in a separate node, not that this really matters for the what we need later. If we’re intending to run a guest with 4 virtual CPUs, we can that it will be desirable to lock the guest to physical CPUs 0-3, or 4-7 to avoid non-local memory accesses. If our guest workload required 8 virtual CPUs, since each NUMA node only has 4 physical CPUs, better utilization may be obtained by running a pair of 4 cpu guests & splitting the work between them, rather than using a single 8 cpu guest.

Deciding which NUMA node to run the guest on

Locking a guest to a particular NUMA node is rather pointless if that node does not have sufficient free memory to allocation for local memory allocations. Indeed, it would be very detrimental to utilization. The next step is to ask libvirt what the free memory is on each node, using the virsh freecell command

# virsh freecell 0
0: 2203620 kB

# virsh freecell 1
1: 3354784 kB

If our guest needs to have 3 GB of RAM allocated, then clearly it needs to be run on NUMA node (cell) 1, rather than node 0, sine the latter only has 2.2 GB available.

Locking the guest to a NUMA node or physical CPU set

We have now decided to run the guest on NUMA node 1, and referring back to the capabilities data about NUMA topology, we see this node has physical CPUs 4-7. When creating the guest XML we can now specify this as the CPU mask for the guest. Where the guest virtual CPU count is specified

<vcpus>4</vcpus>

we can now add the mask

<vcpus cpuset='4-7'>4</vcpus>

As mentioned earlier, this works for QEMU, KVM and Xen guests. In the QEMU/KVM case, libvirt will use the sched_setaffinity call at guest startup, while in the Xen case libvirt will instruct XenD to make an equivalent hypercall.

Automatic placement using virt-install

This walkthrough illustrated the concepts in terms of virsh commands. If writing a management application using libvirt, you would of course use the equivalent APIs for looking up this data, virNodeGetInfo, virConnectGetCapabilities and virNodeGetCellsFreeMemory. The virt-install provisioning tool has done exactly this and provides a simple way to automatically apply a ‘best fit’ NUMA policy when installing guests. Quoting its manual page

   --cpuset=CPUSET

   Set which physical cpus the guest can use. "CPUSET" is a comma separated
   list of numbers, which can also be specified in ranges. Example:

     0,2,3,5     : Use processors 0,2,3 and 5
     1-3,5,6-8   : Use processors 1,2,3,5,6,7 and 8

   If the value ’auto’ is passed, virt-install attempts to automatically
   determine an optimal cpu pinning using NUMA data, if available.

So if you have a NUMA machine and use virt-install, simply always add --cpuset=auto whenever provisioning a new guest.

Fine tuning CPU affinity at runtime

The scheme outlined above is focused on the initial guest placement at boot time. There may be times where it becomes necessary to fine-tune the CPU affinity at runtime. libvirt/virsh can cope with this need too, via the vcpuinfo and vcpupin commands. First, the virsh vcpuinfo command gives you the latest data about where each virtual CPU is running. In this example, rhel5xen is a guest on a Fedora KVM host which I used for RHEL5 Xen package maintenance work. It has 4 virtual CPUs and is being allowed to run on any host CPU

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            3
State:          running
CPU time:       0.5s
CPU Affinity:   yyyyyyyy

VCPU:           1
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           2
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           3
CPU:            2
State:          running
CPU Affinity:   yyyyyyyy

Now lets say the I want to lock each of these virtual CPUs to a separate host CPU in the 2nd NUMA node.

# virsh vcpupin rhel5xen 0 4

# virsh vcpupin rhel5xen 1 5

# virsh vcpupin rhel5xen 2 6

# virsh vcpupin rhel5xen 3 7

The vcpuinfo command can be used again to confirm the placement

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            4
State:          running
CPU time:       32.2s
CPU Affinity:   ----y---

VCPU:           1
CPU:            5
State:          running
CPU time:       16.9s
CPU Affinity:   -----y--

VCPU:           2
CPU:            6
State:          running
CPU time:       11.9s
CPU Affinity:   ------y-

VCPU:           3
CPU:            7
State:          running
CPU time:       14.6s
CPU Affinity:   -------y

And just to prove I’m not faking it all, here’s KVM process running on the host and its /proc status

# grep pid /var/run/libvirt/qemu/rhel5xen.xml
<domstatus state='running' pid='4907'>

# grep Cpus_allowed_list /proc/4907/task/*/status
/proc/4907/task/4916/status:Cpus_allowed_list: 4
/proc/4907/task/4917/status:Cpus_allowed_list: 5
/proc/4907/task/4918/status:Cpus_allowed_list: 6
/proc/4907/task/4919/status:Cpus_allowed_list: 7

Future work

The approach outlined above relies on the fact that the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. While this is sufficient in the simple case, there are some pitfalls along the way. Between the time the guest is started & memory is allocated, RAM from the NUMA node in question may have been used up causing the OS to fallback to allocating from another node. For this reason, if placing guests on NUMA nodes, it is crucial that all guests running on the host have fixed placement, with none allowed to float free. In some wierd and wonderful NUMA topologies (hello Itanium !) there can be NUMA nodes which have only CPUs, and/or only RAM. To cope with these it will be necessary to extend libvirt to allow an explicit memory allocation node to be listed in the guest configuration.