Controlling guest CPU & NUMA affinity in libvirt with QEMU, KVM & Xen

Posted: February 12th, 2010 | Author: | Filed under: libvirt, Virt Tools | 14 Comments »

When provisioning new guests with libvirt, the standard policy for affinity between the guest and host CPUs / NUMA nodes, is to have no policy at all. In other words the guest will follow whatever the hypervisor’s own default policy is, which is usually to run the guest on whatever host CPU is available. There are times when an explicit policy may be better, in particular to make the most of a NUMA architecture it is usually desirable to lock a guest to a particular NUMA node so that its memory allocations are always local to the node it is running on, avoiding the cross-node memory transports which have less bandwidth. As of writing, libvirt supports this capability for QEMU, KVM and Xen guests. Even on a non-NUMA system some form of explicit placement across the hosts’ sockets, cores & hyperthreads may be desired.

Querying host CPU / NUMA topology

The first step in deciding what policy to apply is to figure out the host’s topology is. The virsh nodeinfo command provides information about how many sockets, cores & hyperthreads there are on a host.

# virsh nodeinfo
CPU model:           x86_64
CPU(s):              8
CPU frequency:       1000 MHz
CPU socket(s):       2
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         8179176 kB

There are a total of 8 CPUs, in 2 sockets, each with 4 cores.

More interesting though is the NUMA topology. This can be significantly more complex, so the data is provided in a structured XML document, as part of the virsh capabilities output

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
    </cpu>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <cpus num='4'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <cpus num='4'>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
    </secmodel>
  </host>

 ...removed remaining XML...

</capabilities>

This tells us that there are two NUMA nodes (aka cells), each containing 4 logical CPUs. Since we know there are two sockets, we can obviously infer from this that each socket is in a separate node, not that this really matters for the what we need later. If we’re intending to run a guest with 4 virtual CPUs, we can that it will be desirable to lock the guest to physical CPUs 0-3, or 4-7 to avoid non-local memory accesses. If our guest workload required 8 virtual CPUs, since each NUMA node only has 4 physical CPUs, better utilization may be obtained by running a pair of 4 cpu guests & splitting the work between them, rather than using a single 8 cpu guest.

Deciding which NUMA node to run the guest on

Locking a guest to a particular NUMA node is rather pointless if that node does not have sufficient free memory to allocation for local memory allocations. Indeed, it would be very detrimental to utilization. The next step is to ask libvirt what the free memory is on each node, using the virsh freecell command

# virsh freecell 0
0: 2203620 kB

# virsh freecell 1
1: 3354784 kB

If our guest needs to have 3 GB of RAM allocated, then clearly it needs to be run on NUMA node (cell) 1, rather than node 0, sine the latter only has 2.2 GB available.

Locking the guest to a NUMA node or physical CPU set

We have now decided to run the guest on NUMA node 1, and referring back to the capabilities data about NUMA topology, we see this node has physical CPUs 4-7. When creating the guest XML we can now specify this as the CPU mask for the guest. Where the guest virtual CPU count is specified

<vcpus>4</vcpus>

we can now add the mask

<vcpus cpuset='4-7'>4</vcpus>

As mentioned earlier, this works for QEMU, KVM and Xen guests. In the QEMU/KVM case, libvirt will use the sched_setaffinity call at guest startup, while in the Xen case libvirt will instruct XenD to make an equivalent hypercall.

Automatic placement using virt-install

This walkthrough illustrated the concepts in terms of virsh commands. If writing a management application using libvirt, you would of course use the equivalent APIs for looking up this data, virNodeGetInfo, virConnectGetCapabilities and virNodeGetCellsFreeMemory. The virt-install provisioning tool has done exactly this and provides a simple way to automatically apply a ‘best fit’ NUMA policy when installing guests. Quoting its manual page

   --cpuset=CPUSET

   Set which physical cpus the guest can use. "CPUSET" is a comma separated
   list of numbers, which can also be specified in ranges. Example:

     0,2,3,5     : Use processors 0,2,3 and 5
     1-3,5,6-8   : Use processors 1,2,3,5,6,7 and 8

   If the value ’auto’ is passed, virt-install attempts to automatically
   determine an optimal cpu pinning using NUMA data, if available.

So if you have a NUMA machine and use virt-install, simply always add --cpuset=auto whenever provisioning a new guest.

Fine tuning CPU affinity at runtime

The scheme outlined above is focused on the initial guest placement at boot time. There may be times where it becomes necessary to fine-tune the CPU affinity at runtime. libvirt/virsh can cope with this need too, via the vcpuinfo and vcpupin commands. First, the virsh vcpuinfo command gives you the latest data about where each virtual CPU is running. In this example, rhel5xen is a guest on a Fedora KVM host which I used for RHEL5 Xen package maintenance work. It has 4 virtual CPUs and is being allowed to run on any host CPU

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            3
State:          running
CPU time:       0.5s
CPU Affinity:   yyyyyyyy

VCPU:           1
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           2
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           3
CPU:            2
State:          running
CPU Affinity:   yyyyyyyy

Now lets say the I want to lock each of these virtual CPUs to a separate host CPU in the 2nd NUMA node.

# virsh vcpupin rhel5xen 0 4

# virsh vcpupin rhel5xen 1 5

# virsh vcpupin rhel5xen 2 6

# virsh vcpupin rhel5xen 3 7

The vcpuinfo command can be used again to confirm the placement

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            4
State:          running
CPU time:       32.2s
CPU Affinity:   ----y---

VCPU:           1
CPU:            5
State:          running
CPU time:       16.9s
CPU Affinity:   -----y--

VCPU:           2
CPU:            6
State:          running
CPU time:       11.9s
CPU Affinity:   ------y-

VCPU:           3
CPU:            7
State:          running
CPU time:       14.6s
CPU Affinity:   -------y

And just to prove I’m not faking it all, here’s KVM process running on the host and its /proc status

# grep pid /var/run/libvirt/qemu/rhel5xen.xml
<domstatus state='running' pid='4907'>

# grep Cpus_allowed_list /proc/4907/task/*/status
/proc/4907/task/4916/status:Cpus_allowed_list: 4
/proc/4907/task/4917/status:Cpus_allowed_list: 5
/proc/4907/task/4918/status:Cpus_allowed_list: 6
/proc/4907/task/4919/status:Cpus_allowed_list: 7

Future work

The approach outlined above relies on the fact that the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. While this is sufficient in the simple case, there are some pitfalls along the way. Between the time the guest is started & memory is allocated, RAM from the NUMA node in question may have been used up causing the OS to fallback to allocating from another node. For this reason, if placing guests on NUMA nodes, it is crucial that all guests running on the host have fixed placement, with none allowed to float free. In some wierd and wonderful NUMA topologies (hello Itanium !) there can be NUMA nodes which have only CPUs, and/or only RAM. To cope with these it will be necessary to extend libvirt to allow an explicit memory allocation node to be listed in the guest configuration.

14 Comments

Alan said at 2:42 pm on February 12th, 2010:

Nice post, just one correction in verification, if you're on RHEL5 kernel grep Cpus_allowed – there isn't Cpus_allowed_list in tasks/*/status

Cole said at 3:05 pm on February 12th, 2010:

FYI, virt-manager as of 0.8.0 can do CPU pinning: persistent via 'cpuset' and individual pinning at runtime with the equivalent vcpupin APIs.

caela said at 2:06 pm on April 29th, 2011:

Do you happen to know if there is an equivalent to
xm vcpu-list
when using KVM? I kind of miss it to get a better overview.

Daniel Berrange said at 2:09 pm on April 29th, 2011:

I can’t remember exactly what ‘xm vcpu-list’ does, but perhaps ‘virsh vcpuinfo $DOMNAME’ is close. Failing that, I’d recommend ‘virt-top’ as a good way to monitor guest performance.

caela said at 5:16 pm on April 29th, 2011:

virsh vcpuinfo gives only information for one domU, I would like to see this information for all domUs in my dom0 all in one go, so I don’t by accident pin the vcpus of two domUs to the same physical core when intending to bind them to a single instance.

Jakob Bohm said at 2:17 pm on July 4th, 2011:

Note: The common amd64 (Opteron x86_64) architecture allows NUMA nodes with only CPUs, but not NUMA nodes with no CPUs (because the memory controllers are part of the CPUs). For example if I put RAM blocks only in one side of our dual-quadcore AMD64 server, there will be one NUMA node with 4 CPUs and all the memory and another with 4 CPUs and no memory. Once I add RAM blocks in both sides of the server, there will be RAM (not necessarily the same amount) on both NUMA nodes. With more than 2 CPU sockets, the possibilities become even more interesting.

I am unsure of the NUMA possibilities with Intel x86_64 chips. The older Xeon x86_64 CPUs had no on-chip memory controller and may thus be paired with a chipset that allows memory-only NUMA nodes (empty CPU socket, populated RAM sockets, chipset routes memory access). More recent Intel x86_64 CPUs have onboard memory controllers, making them more like Opterons in principle.

Daniel Berrange said at 11:20 am on July 15th, 2011:

@jakob yes the possibility to have NUMA nodes without any memory is something not well handled by the NUMA functionality I describe above. Since we’re only doing CPU pinning, we’re relying on the kernel to allocate guest memory on the same node. This obviously doesn’t work if there is no memory on a node. For this reason, amongst others, the latest libvirt now also lets you set explicit NUMA memory pinning rules for a guest.

IIUC, the ia64 architecture also allows you to have memory nodes, without any CPUs. So pretty much anything is possible…

Andrew Kinney said at 10:55 pm on October 5th, 2011:

What I’ve seen with testing on our own machines (Supermicro quad socket Opteron 61xx) is that CPU nodes without RAM attach themselves to cell ID 0.

Dave said at 12:02 am on October 7th, 2011:

I have a quad-core Intel Xeon E31240 with one guest, and I’d like it to use the majority of the available cores of the same type as the host.

How do I configure the cpu type on the guest xml config so it reflects all the features that are available?

Currently it seems not all the cpu flags are represented in /proc/cpuinfo on the guest, so I was confused how to configure this properly.

thanks

Daniel Berrange said at 4:18 pm on October 8th, 2011:

@Dave you basically need to run ‘virsh capabilities’ and copy the block from that, into the guest XML. This will make you guest use a CPU model that is as close as possible to the host CPU model. NB, KVM itself may filter out a few CPU flags that are not possible to support, but you should see everything important in the guest.

Adit Ranadive said at 7:29 pm on January 13th, 2012:

@Daniel Do you know if KVM/libvirt now allow capping the percentage of cpu a VM uses? Something equivalent to the ‘xm sched-cred –cap=CAP’ in Xen?
If so which version of libvirt is it supported with?

Thanks,
Adit

Daniel Berrange said at 7:40 pm on January 13th, 2012:

The virsh ‘schedinfo’ command allows setting schedular tunables. With a *very* new kernel there should be the ability to set hard caps, but I can’t remember exact versions.

Laurent Pouilloux said at 1:25 pm on April 30th, 2013:

Hi,

have you try to use the VM CPU placement in the case of a 24 CPU machine ?

It seems there is a bug on the counting of physical CPU (util.get_phy_cpu). I have reported a bug:
https://bugzilla.redhat.com/show_bug.cgi?id=953187

and it also occurs with version 0.600 of virtinst.

Best regards and thank you for the great job !

Siddharth Singh said at 6:38 am on April 21st, 2016:

Nice post. Works on RHEL 7 too where runtime pinning through GUI is unavailable. Thanks!

Leave a Reply





Spam protection: Sum of tw0 plus s1x ?: