Guest CPU model configuration in libvirt with QEMU/KVM

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 8 Comments »

Many of the management problems in virtualization are caused by the annoyingly popular & desirable host migration feature! I previously talked about PCI device addressing problems, but this time the topic to consider is that of CPU models. Every hypervisor has its own policies for what a guest will see for its CPUs by default, Xen just passes through the host CPU, with QEMU/KVM the guest sees a generic model called “qemu32” or “qemu64”.  VMWare does something more advanced, classifying all physical CPUs into a handful of groups and has one baseline CPU model for each group that’s exposed to the guest. VMWare’s behaviour lets guests safely migrate between hosts provided they all have physical CPUs that classify into the same group. libvirt does not like to enforce policy itself, preferring just to provide the mechanism on which the higher layers define their own desired policy. CPU models are a complex subject, so it has taken longer than desirable to support their configuration in libvirt. In the 0.7.5 release that will be in Fedora 13, there is finally a comprehensive mechanism for controlling guest CPUs.

Learning about the host CPU model

If you have been following earlier articles (or otherwise know a bit about libvirt) you’ll know that the “virsh capabilities” command displays an XML document describing the capabilities of the hypervisor connection & host. It should thus come as no surprise that this XML schema has been extended to provide information about the host CPU model. One of the big challenges in describing a CPU models is that every architecture has different approach to exposing their capabilities. On x86, a modern CPUs’ capabilities are exposed via the CPUID instruction. Essentially this comes down to a set of 32-bit integers with each bit given a specific meaning. Fortunately AMD & Intel agree on common semantics for these bits. VMWare and Xen both expose the notion of CPUID masks directly in their guest configuration format. Unfortunately (or fortunately depending on your POV) QEMU/KVM support far more than just the x86 architecture, so CPUID is clearly not suitable as the canonical configuration format. QEMU ended up using a scheme which combines a CPU model name string, with a set of named flags. On x86 the CPU model  maps to a baseline CPUID mask, and the flags can be used to then toggle bits in the mask on or off. libvirt decided to follow this lead and use a combination of a model name and flags. Without further ado, here is an example of what libvirt reports as the capabilities of my laptop’s CPU

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>i686</arch>
      <model>pentium3</model>
      <topology sockets='1' cores='2' threads='1'/>
      <feature name='lahf_lm'/>
      <feature name='lm'/>
      <feature name='xtpr'/>
      <feature name='cx16'/>
      <feature name='ssse3'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='pni'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='sse2'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='clflush'/>
      <feature name='apic'/>
    </cpu>
    ...snip...

In it not practical to have a database listing all known CPU models, so libvirt has a small list of baseline CPU model names.  It picks the one that shares the greatest number of CPUID bits with the actual host CPU and then lists the remaining bits as named features. Notice that libvirt does not tell you what features the baseline CPU contains. This might seem like a flaw at first, but as will be shown next, it is not actually necessary to know this information.

Determining a compatible CPU model to suit a pool of hosts

Now that it is possible to find out what CPU capabilities a single host has, the next problem is to determine what CPU capabilities are best to expose to the guest. If it is known that the guest will never need to be migrated to another host,  the host CPU model can be passed straight through unmodified. Some lucky people might have a virtualized data center where they can guarantee all servers will have 100% identical CPUs. Again the host CPU model can be passed straight through unmodified. The interesting case though, is where there is variation in CPUs between hosts. In this case the lowest common denominator CPU must be determined. This is not entirely straightforward, so libvirt provides an API for exactly this task. Provide libvirt with a list of XML documents, each describing a host’s CPU model, and it will internally convert these to CPUID masks, calculate their intersection, finally converting the CPUID mask result back into an XML CPU description. Taking the CPU description from a random server

<capabilities>
  <host>
    <cpu>
      <arch>x86_64</arch>
      <model>phenom</model>
      <topology sockets='2' cores='4' threads='1'/>
      <feature name='osvw'/>
      <feature name='3dnowprefetch'/>
      <feature name='misalignsse'/>
      <feature name='sse4a'/>
      <feature name='abm'/>
      <feature name='cr8legacy'/>
      <feature name='extapic'/>
      <feature name='cmp_legacy'/>
      <feature name='lahf_lm'/>
      <feature name='rdtscp'/>
      <feature name='pdpe1gb'/>
      <feature name='popcnt'/>
      <feature name='cx16'/>
      <feature name='ht'/>
      <feature name='vme'/>
    </cpu>
    ...snip...

As a quick check is it possible to ask libvirt whether this CPU description is compatible with the previous laptop CPU description, using the “virsh cpu-compare” command

$ ./tools/virsh cpu-compare cpu-server.xml
CPU described in cpu-server.xml is incompatible with host CPU

libvirt is correctly reporting the CPUs are incompatible, because there are several features in the laptop CPU that are missing in the server CPU. To be able to migrate between the laptop and the server, it will be necessary to mask out some features, but which ones ? Again libvirt provides an API for this, also exposed via the “virsh cpu-baseline” command

# virsh cpu-baseline both-cpus.xml
<cpu match='exact'>
  <model>pentium3</model>
  <feature policy='require' name='lahf_lm'/>
  <feature policy='require' name='lm'/>
  <feature policy='require' name='cx16'/>
  <feature policy='require' name='monitor'/>
  <feature policy='require' name='pni'/>
  <feature policy='require' name='ht'/>
  <feature policy='require' name='sse2'/>
  <feature policy='require' name='clflush'/>
  <feature policy='require' name='apic'/>
</cpu>

libvirt has determined that in order to safely migrate a guest between the laptop and the server, it is neccesary to mask out 11 features from the laptop’s XML description.

Configuring the guest CPU model

To simplify life, the guest CPU configuration accepts the same basic XML representation as the host capabilities XML exposes. In other words, the XML from that “cpu-baseline” virsh command, can now be copied directly into the guest XML at the top level under the <domain> element. As the observant reader will have noticed from that last XML snippet, there are a few extra attributes available when describing a CPU in the guest XML. These can mostly be ignored, but for the curious here’s a quick description of what they do. The top level <cpu> element gets an attribute called “match” with possible values

  • match=”minimum” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will also be exposed to the guest
  • match=”exact” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will be masked out from the guest
  • match=”strict” – the host CPU must have exactly the same CPU features described in the guest XML. No more, no less.

The next enhancement is that the <feature> elements can each have an extra “policy” attribute with possible values

  • policy=”force” – expose the feature to the guest even if the host does not have it. This is kind of crazy, except in the case of software emulation.
  • policy=”require” – expose the feature to the guest and fail if the host does not have it. This is the sensible default.
  • policy=”optional” – expose the feature to the guest if it happens to support it.
  • policy=”disable” – if the host has this feature, then hide it from the guest
  • policy=”forbid” – if the host has this feature, then fail and refuse to start the guest

The “forbid” policy is for a niche scenario where a badly behaved application will try to use a feature even if it is not in the CPUID mask, and you wish to prevent accidentally running the guest on a host with that feature. The “optional” policy has special behaviour wrt migration. When the guest is initially started the flag is optional, but when the guest is live migrated, this policy turns into “require”, since you can’t have features disappearing across migration.

All the stuff described in this posting is currently implemented for libvirt’s QEMU/KVM driver, basic code in the 0.7.5/6 releases, but the final ‘cpu-baseline’ stuff is arriving in 0.7.7. Needless to say this will all be available in Fedora 13 and future RHEL. This obviously also needs to be ported over to the Xen and VMWare ESX drivers in libvirt, which isn’t as hard as it sounds, because libvirt has a very good internal API for processed CPUID masks now. Kudos to Jiri Denemark for doing all the really hardwork on this CPU modelling system!

Stable guest machine ABI, PCI addressing and disk controllers in libvirt

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 2 Comments »

If you are using a physical machine, the chances are that the only things you hotplug are USB devices, or more occasionally SCSI disk drives, never PCI devices themselves. Although you might flash upgrade the BIOS, doing so is not all that likely to change the machine ABI seen by the operating system. The world of virtual machines though, is not quite so static. It is common for administrators to want to hotplug CPUs, memory, USB devices, PCI devices & disk drives. Upgrades of the underlying virtualization software will also bring in new features & capabilities to existing configured devices in the virtual machine, potentially changing the machine ABI seen by the guest OS. Some operating systems (Linux !) cope with these kind of changes without batting an eyelid. Other operating systems (Windows !) get very upset and decide you have to re-activate the license keys for your operating system, or re-configure your devices.

Stable machine ABIs

The first problem we addressed in libvirt was that of providing a stable machine ABI, so that when you upgrade QEMU/KVM, the guest doesn’t unexpectedly break. QEMU has always had the concept of a “machine type”. In the x86 emulator this let you switch between a plain old boring ISA only PC, and a new ISA+PCI enabled PC. The libvirt capabilities XML format exposed the allowed machine types for each guest architecture

<guest>
  <os_type>hvm</os_type>
  <arch name='i686'>
    <wordsize>32</wordsize>
    <emulator>/usr/bin/qemu</emulator>
    <machine>pc</machine>
    <machine>isapc</machine>
    <domain type='qemu'>
    </domain>
  </arch>
</guest>

This tells you that the i686 emulator with path /usr/bin/qemu supports two machine types “pc” and “isapc”. This information then appears in the guest XML

<domain type='kvm'>
  <name>plain</name>
  <uuid>c7a1edbd-edaf-9455-926a-d65c16db1809</uuid>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    ...snip...
  </os>
  ...snip...
</domain>

To support a stable machine ABI, the QEMU developers introduced the idea of versioned machine types. Instead of just “pc”, there is now “pc-0.10”, “pc-0.11”, etc new versions being added for each QEMU release that changes something in the machine type. The original “pc” machine type is declared to be an alias pointing to the latest version. libvirt captures this information in the capabilities XML, by listing all machine types and using the “canonical” attribute to link up the alias

<guest>
 <os_type>hvm</os_type>
 <arch name='i686'>
   <wordsize>32</wordsize>
   <emulator>/usr/bin/qemu</emulator>
   <machine>pc-0.11</machine>
   <machine canonical='pc-0.11'>pc</machine>
   <machine>pc-0.10</machine>
   <machine>isapc</machine>
   <domain type='qemu'>
   </domain>
</guest>

Comparing to the earlier XML snippet, you can see 2 new machine types “pc-0.10” and “pc-0.11”, and can determine that “pc” is an alias for “pc-0.11”. The next clever part is that when you define / create a new guest in libvirt, if you specify a machine type of “pc”, then libvirt will automatically canonicalize this to “pc-0.11” for you. This means an application developer never need worry about this most of the time, they will automatically get a stable versioned machine ABI for all their guests

PCI device addressing

The second problem faced is that of ensuring that a device’s PCI address does not randomly change across reboots, or even across host migrations. This problem was primarily caused by the fact that you can hotplug/unplug arbitrary PCI devices at runtime. As an example, a guest might boot with 2 disks disks and 1 NIC, a total of 3 PCI devices. These get assigned PCI slots 4, 5 and 6 respectively. The admin then unplugs one of virtio disk. When the guest then reboots, or migrates to another host, QEMU will assign PCI slots 4 & 5. In other words the NIC has unexpectedly moved from slot 6 to slot 5. Migration is unlikely to be successful when this happens! The first roadblock in attempting to solve this problem was that QEMU did not provide any way for a management application to specify a PCI address for devices. As of the QEMU 0.12 release though, this limitation is finally removed with the introduction of a new generic ‘-device’ argument for configuring virtual hardware devices. QEMU only supports a single PCI bus and no bridges, so we can only configure the PCI slot number at this time, but this is sufficient. As part of a giant patch series I switched libvirt over to use this new syntax.

In implementing this, it was neccessary to add a way to record the PCI addresses in the libvirt guest XML format. After a little head scratching we settled on adding a generic “address” element to every single device in the libvirt XML. This example shows what it looks like in the context of a NIC definition

    <interface type='network'>
      <mac address='52:54:00:f7:c5:0e'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='e1000'/>
      <alias name='e1000-nic2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </interface>

Requiring every management application to assign PCI addresses to devices was not at all desirable though, they should not need to think about this kind of thing. Thus whenever a new guest is defined in libvirt with the QEMU/KVM driver, the first thing libvirt does is to issue unique PCI addresses to every single device in the XML. This neatly side-steps the minor problem of having to tell apps that PCI slot 1 is reserved for the PIIX3, slot 2 for the VGA controller and slot 3 for the balloon device. So an application defines a guest without addresses and then can query libvirt for the updated XML to discover what addresses were assigned. For older QEMU < 0.12, we can’t support static PCI addressing due to lack of the “-device” argument, but libvirt will still report on what addresses were actually assigned by QEMU at runtime.

At time of hotplugging a device, the same principals are applied. The application passes in the device XML without any address info and libvirt assigns a PCI slot. The only minor caveat is that if an application then invokes the “define” operation to make the new device persistent, it should take care to copy across the auto-assigned address. This limitation will be addressed in the near future with an improved libvirt hotplug API.

Disk controllers & drive addressing

At around the same time that I was looking at static PCI addressing, Wolfgang Mauerer, was proposing a way to fix the SCSI disk hotplug support in libvirt. The problem here was that at boot time, if you listed 4 SCSI disks, you would get 1 SCSI controller with 4 SCSI disks attached. Meanwhile if you tried hotplugging 4 SCSI disks, you’d get 4 SCSI controllers each with 1 disk attached. In other words we were faking SCSI disk hotplug, by attaching entirely new SCSI controllers. It doesn’t take a genius to work out that this is going to crash & burn at next reboot, or migration, when those hotplugged SCSI controllers all disappear and the disks end up back on a single controller. The solution here was to explicitly model the idea of a disk controller in the libvirt guest XML, separate from the disk itself. For this task, we invented a fairly simple new XML syntax that looks like this

    <controller type='scsi' index='0'>
      <alias name='scsi1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0c' function='0x0'/>
    </controller>

Each additional SCSI controller defined would get a new “index” value , and of course has a unique auto-assigned PCI address too. Notice how the “address” element has a type attribute. This bit of forward planning allows libvirt to introduce new address types later. This capability is now used to link disks to controllers, but defining a “drive” address type, consisting of a controller ID, bus ID and unit ID. In this example, the SCSI disk “sdz” is linked to controller 3, attached as unit 4.

    <disk type='file' device='disk'>
      <source file='/home/berrange/output.img'/>
      <target dev='sdz' bus='scsi'/>
      <alias name='scsi3-0-4'/>
      <address type='drive' controller='3' bus='0' unit='4'/>
    </disk>

Since no existing libvirt application knows about the “controller” or “address” elements, libvirt will automatically populate these as required. So in this example, if the SCSI disk were specified without any “address” element, libvirt will add one, figuring out controller & unit properaties based onthe device name “sdz”. Each SCSI controller can have 7 units attached, and “sdz” corresponds to drive number 26, hence ending up controller 3 (26 / 7) and unit 4 ((26 % 7)-1). Once we added controllers and drive addresses for SCSI disks, naturally the same was also done for IDE and floppy disks. Virtio disks are a little special in that there is no separate VirtIO disk controller, every VirtIO disk is a separate PCI device. This is somewhat of a scalability problem, since QEMU only has 31 PCI slots and 3 of those are already taken up with the PIIX3, VGA adapter & balloon driver. Needless to say, we hope that QEMU will implement a PCI bridge soon, or introduce a real VirtIO disk controller.

Adventures in migrating from static website + Blogger SFTP to WordPress

Posted: February 14th, 2010 | Filed under: Uncategorized | 2 Comments »

For many years now my main website has consisted of a set of statically generated webpages providing the overall structure. A couple of areas, notably my main blog were then dynamically generated using Blogger. The reason I started using Blogger was that it has the ability to publish posts directly to my webserver using SSH/SFTP, thus allowing the dynamic parts of the site to seemlessly integrate with the static parts. Then a couple of weeks ago, Blogger announced that they were discontinuing support for SFTP publishing on March 26th. Needless to say, this rather ruined my website publishing architecture. After thinking about things for a couple of weeks though, I decided this decision of Blogger’s is a blessing in disguise, because the way I managed website was completely outdated & needed to be brought into modern world.

What I in fact needed for a very simple content management system, that allowed publishing a small number of ‘static’ pages on site, but with the majority of the content being blog postings. Categorization, tagging & external links would be desirable too. Of course it has to be open source software too, capable of running on both my Debian Lenny webserver & Fedora laptop. As many people are no doubt aware, this is exactly what WordPress provides. As a proof of concept I downloaded the latest WordPress, tried out the install process on my laptop & generally got a feel for its admin capabilities. It all looked perfect, so that was a good decision made.

Exporting content from Blogger

Over the years that I’ve been using Blogger, I’ve written a few hundred postings, many worthless trash, but a fair number of them have really useful & frequently visited content. My recent series of articles on libvirt features have been particularly popular. It is absolutely non-negotiable that all existing links to these postings continue to work & don’t all end up broken. So the first step of the migration was to figure out how to export the content from Blogger into WordPress. The first thing I tried was WordPress’ own built-in import tool that can allegedly talk directly to Blogger and pull down all the postings & comments. The first problem I found with this, is that it only works if your content was hosted on Blogger. ie if you were using SFTP publishing it always reports ‘0 posts’. I temporarily update my blog settings to turn off SFTP and it at least detected all the posts at that point. I started the import process & it imported 3 posts and 70 comments and then gave up with no indication of what’s wrong. Tried again, and the same thing happened. Searching the WordPress forums it seems many people have hit this problem over the past 2 years with no reliable solution yet available.

Then I investigated whether Blogger had its own export capabilities. It does. It can export all your blog posts and comments in a single XML file. Unfortunately there is no apparent standard XML schema for blog import/export so there didn’t seem to be much use for this export capability & I didn’t fancy writing my own XSL transform to convert it to WordPress’ native XML import schema. The nice thing about Blogger and WordPress being so widely used on the web, is that if you have a problem, then the chances are that someone else has had the same problem already. In fact so many people have had this problem, that someone’s already written an tool to solve this, Google Blog Convertors

I tried downloaded it, fed it the Blogger exports and it generated some nice looking WordPress XML files. A closer look revealed one tiny flaw – it had unescaped a whole bunch of HTML tags in blog posts where I had been including snippets of example XML or HTML inside <pre> tags. Fortunately the code is all python and it was easy to find the bogus line of code “content = unescape(text)” and replace it with just “content = text“. After that the files imported into WordPress perfectly, preserving all formatting and comments.

Setting up URL redirects

Even though WordPress has a nice friendly URL scheme for articles based on their title, it is very slightly different from the scheme Blogger used for URLs. I was also merging several separate Blogger feeds into one, since WordPress has a nice categorization capability. It was thus inevitable that the URLs for all existing posts would have to change. The solution to this problem was pretty straightforward. Apache’s mod_rewrite engine can be told to load external files containing arbitrary key, value mappings, and then reference these maps in rewrite rules. It was a simple, albeit slightly tedious, process to write a map file that contained the old Blogger URL as the key and the new WordPress URL as the value. As an example, a tiny part of the map I created looks like this

/diary/2008/04/presentation-is-everything /posts/2008/04/21/presentation-is-everything
/diary/2008/06/red-hat-summit-2008 /posts/2008/06/18/red-hat-summit-2008
/diary/2009/12/using-qcow2-disk-encryption-with /posts/2009/12/02/using-qcow2-disk-encryption-with-libvirt-in-fedora-12

To make use of this map, just requires two rules in the httpd.conf file, one to load the map and the other to add a match for it. Those rules look like this

RewriteMap blogger txt:/etc/apache2/blogger-rewrite.txt
RewriteRule ^/personal(/diary/.*) ${blogger:$1} [L,R=permanent]

In summary, while the migration process from Blogger to WordPress was not entirely smooth, it went alot better than I expected it to. Any web user following an old link to a post on my site now gets a permanent redirect to the new location, so no important links were broken during the migration. The new site I have is soo much more flexible than the old one & the WordPress UI is  very much nicer to use. Blogger’s UI is rather dated & not really on a par with the standard of Google’s other popular apps like GMail.

Controlling guest CPU & NUMA affinity in libvirt with QEMU, KVM & Xen

Posted: February 12th, 2010 | Filed under: libvirt, Virt Tools | 16 Comments »

When provisioning new guests with libvirt, the standard policy for affinity between the guest and host CPUs / NUMA nodes, is to have no policy at all. In other words the guest will follow whatever the hypervisor’s own default policy is, which is usually to run the guest on whatever host CPU is available. There are times when an explicit policy may be better, in particular to make the most of a NUMA architecture it is usually desirable to lock a guest to a particular NUMA node so that its memory allocations are always local to the node it is running on, avoiding the cross-node memory transports which have less bandwidth. As of writing, libvirt supports this capability for QEMU, KVM and Xen guests. Even on a non-NUMA system some form of explicit placement across the hosts’ sockets, cores & hyperthreads may be desired.

Querying host CPU / NUMA topology

The first step in deciding what policy to apply is to figure out the host’s topology is. The virsh nodeinfo command provides information about how many sockets, cores & hyperthreads there are on a host.

# virsh nodeinfo
CPU model:           x86_64
CPU(s):              8
CPU frequency:       1000 MHz
CPU socket(s):       2
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         8179176 kB

There are a total of 8 CPUs, in 2 sockets, each with 4 cores.

More interesting though is the NUMA topology. This can be significantly more complex, so the data is provided in a structured XML document, as part of the virsh capabilities output

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
    </cpu>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <cpus num='4'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <cpus num='4'>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
    </secmodel>
  </host>

 ...removed remaining XML...

</capabilities>

This tells us that there are two NUMA nodes (aka cells), each containing 4 logical CPUs. Since we know there are two sockets, we can obviously infer from this that each socket is in a separate node, not that this really matters for the what we need later. If we’re intending to run a guest with 4 virtual CPUs, we can that it will be desirable to lock the guest to physical CPUs 0-3, or 4-7 to avoid non-local memory accesses. If our guest workload required 8 virtual CPUs, since each NUMA node only has 4 physical CPUs, better utilization may be obtained by running a pair of 4 cpu guests & splitting the work between them, rather than using a single 8 cpu guest.

Deciding which NUMA node to run the guest on

Locking a guest to a particular NUMA node is rather pointless if that node does not have sufficient free memory to allocation for local memory allocations. Indeed, it would be very detrimental to utilization. The next step is to ask libvirt what the free memory is on each node, using the virsh freecell command

# virsh freecell 0
0: 2203620 kB

# virsh freecell 1
1: 3354784 kB

If our guest needs to have 3 GB of RAM allocated, then clearly it needs to be run on NUMA node (cell) 1, rather than node 0, sine the latter only has 2.2 GB available.

Locking the guest to a NUMA node or physical CPU set

We have now decided to run the guest on NUMA node 1, and referring back to the capabilities data about NUMA topology, we see this node has physical CPUs 4-7. When creating the guest XML we can now specify this as the CPU mask for the guest. Where the guest virtual CPU count is specified

<vcpus>4</vcpus>

we can now add the mask

<vcpus cpuset='4-7'>4</vcpus>

As mentioned earlier, this works for QEMU, KVM and Xen guests. In the QEMU/KVM case, libvirt will use the sched_setaffinity call at guest startup, while in the Xen case libvirt will instruct XenD to make an equivalent hypercall.

Automatic placement using virt-install

This walkthrough illustrated the concepts in terms of virsh commands. If writing a management application using libvirt, you would of course use the equivalent APIs for looking up this data, virNodeGetInfo, virConnectGetCapabilities and virNodeGetCellsFreeMemory. The virt-install provisioning tool has done exactly this and provides a simple way to automatically apply a ‘best fit’ NUMA policy when installing guests. Quoting its manual page

   --cpuset=CPUSET

   Set which physical cpus the guest can use. "CPUSET" is a comma separated
   list of numbers, which can also be specified in ranges. Example:

     0,2,3,5     : Use processors 0,2,3 and 5
     1-3,5,6-8   : Use processors 1,2,3,5,6,7 and 8

   If the value ’auto’ is passed, virt-install attempts to automatically
   determine an optimal cpu pinning using NUMA data, if available.

So if you have a NUMA machine and use virt-install, simply always add --cpuset=auto whenever provisioning a new guest.

Fine tuning CPU affinity at runtime

The scheme outlined above is focused on the initial guest placement at boot time. There may be times where it becomes necessary to fine-tune the CPU affinity at runtime. libvirt/virsh can cope with this need too, via the vcpuinfo and vcpupin commands. First, the virsh vcpuinfo command gives you the latest data about where each virtual CPU is running. In this example, rhel5xen is a guest on a Fedora KVM host which I used for RHEL5 Xen package maintenance work. It has 4 virtual CPUs and is being allowed to run on any host CPU

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            3
State:          running
CPU time:       0.5s
CPU Affinity:   yyyyyyyy

VCPU:           1
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           2
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           3
CPU:            2
State:          running
CPU Affinity:   yyyyyyyy

Now lets say the I want to lock each of these virtual CPUs to a separate host CPU in the 2nd NUMA node.

# virsh vcpupin rhel5xen 0 4

# virsh vcpupin rhel5xen 1 5

# virsh vcpupin rhel5xen 2 6

# virsh vcpupin rhel5xen 3 7

The vcpuinfo command can be used again to confirm the placement

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            4
State:          running
CPU time:       32.2s
CPU Affinity:   ----y---

VCPU:           1
CPU:            5
State:          running
CPU time:       16.9s
CPU Affinity:   -----y--

VCPU:           2
CPU:            6
State:          running
CPU time:       11.9s
CPU Affinity:   ------y-

VCPU:           3
CPU:            7
State:          running
CPU time:       14.6s
CPU Affinity:   -------y

And just to prove I’m not faking it all, here’s KVM process running on the host and its /proc status

# grep pid /var/run/libvirt/qemu/rhel5xen.xml
<domstatus state='running' pid='4907'>

# grep Cpus_allowed_list /proc/4907/task/*/status
/proc/4907/task/4916/status:Cpus_allowed_list: 4
/proc/4907/task/4917/status:Cpus_allowed_list: 5
/proc/4907/task/4918/status:Cpus_allowed_list: 6
/proc/4907/task/4919/status:Cpus_allowed_list: 7

Future work

The approach outlined above relies on the fact that the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. While this is sufficient in the simple case, there are some pitfalls along the way. Between the time the guest is started & memory is allocated, RAM from the NUMA node in question may have been used up causing the OS to fallback to allocating from another node. For this reason, if placing guests on NUMA nodes, it is crucial that all guests running on the host have fixed placement, with none allowed to float free. In some wierd and wonderful NUMA topologies (hello Itanium !) there can be NUMA nodes which have only CPUs, and/or only RAM. To cope with these it will be necessary to extend libvirt to allow an explicit memory allocation node to be listed in the guest configuration.