make-tiny-image.py: creating tiny initrds for testing QEMU or Linux kernel/userspace behaviour

Posted: March 9th, 2023 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: , , | No Comments »

As a virtualization developer a significant amount of time is spent in understanding and debugging the behaviour and interaction of QEMU and the guest kernel/userspace code. As such my development machines have a variety of guest OS installations that get booted for various tasks. Some tasks, however, require a repeated cycle of QEMU code changes, or QEMU config changes, followed by guest testing. Waiting for an OS to boot can quickly become a significant time sink affecting productivity and lead to frustration. What is needed is a very low overhead way to accomplish simple testing tasks without an OS getting in the way.

Enter ‘make-tiny-image.py‘ tool for creating minimal initrd images.

If invoked with no arguments, this tool will create an initrd containing nothing more than busybox. The “init” program will be a script that creates a few device nodes, mounts proc/sysfs and then runs the busybox ‘sh’ binary to provide an interactive shell. This is intended to be used as follows

$ ./make-tiny-image.py
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -accel kvm -m 1000 -display none -serial stdio
~ # uname  -a
Linux (none) 6.0.8-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 11 15:09:04 UTC 2022 x86_64 x86_64 x86_64 Linux
~ # uptime
 15:05:42 up 0 min,  load average: 0.00, 0.00, 0.00
~ # free
              total        used        free      shared  buff/cache   available
Mem:         961832       38056      911264        1388       12512      845600
Swap:             0           0           0
~ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
none                    480916         0    480916   0% /dev
~ # ls
bin   dev   init  proc  root  sys   usr
~ # <Ctrl+D>
[   23.841282] reboot: Power down

When I say “low overhead”, just how low are we talking about ? With KVM, it takes less than a second to bring up the shell. Testing with emulation is where this really shines. Booting a full Fedora OS with QEMU emulation is slow enough that you don’t want to do it at all frequently. With this tiny initrd, it’ll take a little under 4 seconds to boot to the interactive shell. Much slower than KVM, but fast enough you’ll be fine repeating this all day long, largely unaffected by the (lack of) speed relative to KVM.

The make-tiny-image.py tool will create the initrd such that it drops you into a shell, but it can be told to run another command instead. This is how I tested the overheads mentioned above

$ ./make-tiny-image.py --run poweroff
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ time qemu-system-x86_64 \
     -kernel /boot/vmlinuz-$(uname -r) \
     -initrd tiny-initrd.img \
     -append 'console=ttyS0 quiet' \
     -m 1000 -display none -serial stdio -accel kvm
[    0.561174] reboot: Power down

real	0m0.828s
user	0m0.613s
sys	0m0.093s
$ time qemu-system-x86_64 \
     -kernel /boot/vmlinuz-$(uname -r) \
     -initrd tiny-initrd.img \
     -append 'console=ttyS0 quiet' \
     -m 1000 -display none -serial stdio -accel tcg
[    2.741983] reboot: Power down

real	0m3.774s
user	0m3.626s
sys	0m0.174s

As a more useful real world example, I wanted to test the effect of changing the QEMU CPU configuration against KVM and QEMU, by comparing at the guest /proc/cpuinfo.

$ ./make-tiny-image.py --run 'cat /proc/cpuinfo'
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel tcg -cpu max | grep '^flags'
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
                  cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss syscall 
                  nx mmxext pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl 
                  cpuid extd_apicid pni pclmulqdq monitor ssse3 cx16 sse4_1 
                  sse4_2 movbe popcnt aes xsave rdrand hypervisor lahf_lm 
                  svm cr8_legacy abm sse4a 3dnowprefetch vmmcall fsgsbase 
                  bmi1 smep bmi2 erms mpx adx smap clflushopt clwb xsaveopt 
                  xgetbv1 arat npt vgif umip pku ospke la57

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel kvm -cpu max | grep '^flags'
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
                  cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx 
                  pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good 
                  nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx 
                  ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe 
                  popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
                  hypervisor lahf_lm abm 3dnowprefetch cpuid_fault 
                  invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced 
                  tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase 
                  tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx 
                  rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 
                  xsaves arat umip sgx_lc md_clear arch_capabilities

NB, with the list of flags above, I’ve manually line wrapped the output for saner presentation in this blog rather than have one giant long line.

These examples have relied on tools provided by busybox, but we’re not limited by that. It is possible to tell it to copy in arbitrary extra binaries from the host OS by just listing their name. If it is a dynamically linked ELF binary, it’ll follow the ELF header dependencies, pulling in any shared libraries needed.

$ ./make-tiny-image.py hwloc-info lstopo-no-graphics
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy bin /usr/bin/hwloc-info -> /tmp/make-tiny-imagexu_mqd99/bin/hwloc-info
Copy bin /usr/bin/lstopo-no-graphics -> /tmp/make-tiny-imagexu_mqd99/bin/lstopo-no-graphics
Copy lib /lib64/libhwloc.so.15 -> /tmp/make-tiny-imagexu_mqd99/lib64/libhwloc.so.15
Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libc.so.6
Copy lib /lib64/libm.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libm.so.6
Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-imagexu_mqd99/lib64/ld-linux-x86-64.so.2
Copy lib /lib64/libtinfo.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libtinfo.so.6

$ qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm 
~ # hwloc-info 
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        1 L2Cache (type #5)
    depth 4:       1 L1dCache (type #4)
     depth 5:      1 L1iCache (type #9)
      depth 6:     1 Core (type #2)
       depth 7:    1 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  1 Bridge (type #14)
Special depth -5:  3 PCIDev (type #15)
Special depth -6:  1 OSDev (type #16)
Special depth -7:  1 Misc (type #17)

~ # lstopo-no-graphics 
Machine (939MB total)
  Package L#0
    NUMANode L#0 (P#0 939MB)
    L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  HostBridge
    PCI 00:01.1 (IDE)
      Block "sr0"
    PCI 00:02.0 (VGA)
    PCI 00:03.0 (Ethernet)
  Misc(MemoryModule)

An obvious limitation is that if the binary/library requires certain data files, those will not be present in the initrd. It isn’t attempting to do anything clever like query the corresponding RPM file list and copy those. This tool is meant to be simple and fast and keep out of your way. If certain data files are critical for testing though, the --copy argument can be used. The copied files will be put at the same path inside the initrd as found on the host

$ ./make-tiny-image.py --copy /etc/redhat-release 
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy extra /etc/redhat-release -> /tmp/make-tiny-imageicj1tvq4/etc/redhat-release

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel kvm 
~ # cat /etc/redhat-release 
Fedora release 37 (Thirty Seven)

What if the problem being tested requires using some kernel modules ? That’s covered too with the --kmod argument, which will copy in the modules listed, along with their dependencies and the insmod command itself. As an example of its utility, I used this recently to debug a regression in support for the iTCO watchdog in Linux kernels

$ ./make-tiny-image.py --kmod lpc_ich --kmod iTCO_wdt  --kmod i2c_i801 
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/lpc_ich.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/lpc_ich.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_wdt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_wdt.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_vendor_support.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_vendor_support.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/intel_pmc_bxt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/intel_pmc_bxt.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/busses/i2c-i801.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-i801.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/i2c-smbus.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-smbus.ko.xz
Copy bin /usr/sbin/insmod -> /tmp/make-tiny-image63td8wbl/bin/insmod
Copy lib /lib64/libzstd.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libzstd.so.1
Copy lib /lib64/liblzma.so.5 -> /tmp/make-tiny-image63td8wbl/lib64/liblzma.so.5
Copy lib /lib64/libz.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libz.so.1
Copy lib /lib64/libcrypto.so.3 -> /tmp/make-tiny-image63td8wbl/lib64/libcrypto.so.3
Copy lib /lib64/libgcc_s.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libgcc_s.so.1
Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-image63td8wbl/lib64/libc.so.6
Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-image63td8wbl/lib64/ld-linux-x86-64.so.2

$ ~/src/virt/qemu/build/qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm  -M q35 -global ICH9-LPC.noreboot=false -watchdog-action poweroff -trace ich9* -trace tco*
ich9_cc_read addr=0x3410 val=0x20 len=4
ich9_cc_write addr=0x3410 val=0x0 len=4
ich9_cc_read addr=0x3410 val=0x0 len=4
ich9_cc_read addr=0x3410 val=0x0 len=4
ich9_cc_write addr=0x3410 val=0x20 len=4
ich9_cc_read addr=0x3410 val=0x20 len=4
tco_io_write addr=0x4 val=0x8
tco_io_write addr=0x6 val=0x2
tco_io_write addr=0x6 val=0x4
tco_io_read addr=0x8 val=0x0
tco_io_read addr=0x12 val=0x4
tco_io_write addr=0x12 val=0x32
tco_io_read addr=0x12 val=0x32
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
~ # mknod /dev/watchdog0 c 10 130
~ # cat /dev/watchdog0
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
cat: read error: Invalid argument
[   11.052062] watchdog: watchdog0: watchdog did not stop!
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
~ # tco_timer_expired timeouts_no=0 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)
tco_timer_expired timeouts_no=1 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)
tco_timer_expired timeouts_no=0 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)

The Linux regression had accidentally left the watchdog with the ‘no reboot’ bit set, so it would never trigger the action, which we diagnosed from seeing repeated QEMU trace events for tco_timer_expired after triggering the watchdog in the guest. This was quicky fixed by the Linux maintainers.

In spite of being such a simple and crude script, with many, many, many unhandled edge cases, it has proved remarkably useful at enabling low overhead debugging of QEMU/Linux guest behaviour.

Nova metadata recorded in libvirt guest instance XML

Posted: February 20th, 2015 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: , , , | No Comments »

One of the issues encountered when debugging libvirt guest problems with Nova, is that it isn’t always entirely obvious why the guest XML is configured the way it is. For a while now, libvirt has had the ability to record arbitrary application specific metadata in the guest XML. Each application simply declares the XML namespace it wishes to use and can then record whatever it wants. Libvirt will treat this metadata as a black box, never attempting to interpret or modify it. In the Juno release I worked on a blueprint to make use of this feature to record some interesting information about Nova.

The initial set of information recorded is as follows:

  • Version – the Nova version number, and any vendor specific package suffiix (eg RPM release number). This is useful as the user reporting a bug is often not entirely clear what particular RPM version was installed when the guest was first booted.
  • Name – the Nova instance display name. While you can correlated Nova instances to libvirt guests using the UUID, users reporting bugs often only tell you the display name. So recording this in the XML is handy to correlate which XML config corresponds to which Nova guest they’re talking about.
  • Creation time – the time at which Nova booted the guest. Sometimes useful when trying to understand the sequence in which things happened.
  • Flavour – the Nova flavour name, memory, disk, swap, ephemeral and vcpus settings. Flavours can be changed by the admin after a guest is booted, so having the original values recorded against the guest XML is again handy.
  • Owner – the tenant user ID and name, as well as their project
  • Root image – the glance image ID, if the guest was booted from an image

The Nova version number information in particular has already proved very useful in a couple of support tickets, showing that the VM instance was not booted under the software version that was initially claimed. There is still scope for augmenting this information further though. When working on another support issues it would have been handy to know the image properties and flavour extra specs that were set, as the user’s bug report also gave misleading / incorrect information in this area. Information about cinder block devices would also be useful to have access to, for cases where the guest isn’t booting from an image.

While all this info is technically available from the Nova database, it is far easier (and less dangerous) to ask the user to provide the libvirt XML configuration than to have them run random SQL queries. Standard OS trouble shooting tools such as  sosreport from RHEL/Fedora already collect the libvirt guest XML when run. As a result, the bug report is more likely to contain this useful data in the initial filing, avoiding the need to ask the user to collect further data after the fact.

To give an example of what the data looks like, a Nova guest booted with

$ nova boot --image cirros-0.3.0-x86_64-uec --flavor m1.tiny vm1

Gets the following data recorded

$ virsh -c qemu:///system dumpxml instance-00000001
<domain type='kvm' id='2'>
  <name>instance-00000001</name>
  <uuid>d0e51bbd-cbbd-4abc-8f8c-dee2f23ded12</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="2015.1"/>
      <nova:name>vm1</nova:name>
      <nova:creationTime>2015-02-19 18:23:44</nova:creationTime>
      <nova:flavor name="m1.tiny">
        <nova:memory>512</nova:memory>
        <nova:disk>1</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="ef53a6031fc643f2af7add439ece7e9d">admin</nova:user>
        <nova:project uuid="60a60883d7de429aa45f8f9d689c1fd6">demo</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="2344a0fc-a34b-4e2d-888e-01db795fc89a"/>
    </nova:instance>
  </metadata>
 ...snip...
</domain>

The intention is that as long as the XML namespace URI (http://openstack.org/xmlns/libvirt/nova/1.0) isn’t changed, the data reported here will not change in a backwards incompatible manner. IOW, we will add further elements or attributes to the Nova metadata, but not change or remove existing elements or attributes. So if OpenStack related troubleshooting / debugging tools want to extract this data from the libvirt XML they can be reasonably well assured of compatibility across future Nova releases.

In the Kilo development cycle there have been patches submitted to record similar kind of data for VMWare guests, though obviously it uses a config data format relevant to VMWare rather than XML. Sadly this useful debugging improvement for VMWare had its feature freeze exception request rejected, pushing it out to Liberty, which is rather a shame :-(

Watching the libvirt RPC protocol using SystemTAP

Posted: November 30th, 2011 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , , | 1 Comment »

A couple of releases back I completely re-structured all the RPC handling code inside libvirt to make sure it could be properly shared between the client and server, as well as decoupling the RPC handling code from the implementation of the RPC functions. As part of this work I introduced a fairly comprehensive set of DTrace static probe points into the libvirt RPC code. While one could write a WireShark plugin that is able to decode the libvirt RPC protocol (oh look Michal already has written one), that would not be able to examine encrypted libvirt connections – which is pretty much all of them. By using static probes in the libvirt RPC code we can see the RPC messages being sent and received before/after encryption has been applied.

The observant will notice that I said I inserted DTrace static probes, while this blog subject line says SystemTAP. Well the SystemTAP developers had the good sense to make their userspace probing infrastructure support the DTrace static probe marker syntax. So inserting DTrace static probes into userspace code, trivially enables support for both DTrace and SystemTAP. I previously added DTrace probe support to QEMU/KVM and was very happy when Bryan Cantrill told me (at the recent KVM forum) that the DTrace probe support I added to KVM only needed minor build system tweaks to work on Solaris, despite my only ever having tested with Linux + SystemTAP.

Along with adding the DTrace markers to the libvirt RPC code, I also created two SystemTAP tapset files to make it simpler to use the probes from SystemTAP scripts. The first, /usr/share/systemtap/tapset/libvirt_probe.stp, contains the actual probe points, grouped by functional area, while the second, /usr/share/systemtap/tapset/libvirt_functions.stp, contains a bunch of helper functions for converting enum values into human friendly strings. The idea is that instead of seeing “Procedure 53”, a sysadmin would much rather see “Procedure domain_dump_core”. I won’t go into detail about what is in those two files here, instead I’ll just illustrate their use

Tracing the RPC client

Lets says we first want to see what messages the client is sending and receiving. There are two interesting probes here, “libvirt.rpc.server_client_msg_rx” and “libvirt.rpc.server_client_msg_tx_queue“. The former is triggered when a complete RPC message has been read off the wire, while the latter is triggered when an RPC message is queued for transmission. Ideally we would also have another probe triggered when an RPC message has been completely transmitted – that’s a future todo item. Simple usage of these two probes would be

# cat > demo.stp <<EOF
probe libvirt.rpc.client_msg_rx {
  printf("client=%p len=%d program=%d version=%d procedure=%d type=%d status=%d serial=%d\n",
         client, len, prog, vers, proc, type, status, serial);
}
probe libvirt.rpc.client_msg_tx_queue {
  printf("client=%p len=%d program=%s version=%d procedure=%s type=%s status=%d serial=%d\n",
         client, len, prog, vers, proc, type, status, serial);
}
EOF
# stap demo.stp
client=0x7f827c3b1010 len=28 program=536903814 version=1 procedure=66 type=0 status=0 serial=0
client=0x7f827c3b1010 len=36 program=536903814 version=1 procedure=66 type=1 status=0 serial=0
client=0x7f827c3b1010 len=40 program=536903814 version=1 procedure=1 type=0 status=0 serial=1
client=0x7f827c3b1010 len=28 program=536903814 version=1 procedure=1 type=1 status=0 serial=1
client=0x7f827c3b1010 len=28 program=536903814 version=1 procedure=110 type=0 status=0 serial=2
client=0x7f827c3b1010 len=48 program=536903814 version=1 procedure=110 type=1 status=0 serial=2
client=0x7f827c3b1010 len=28 program=536903814 version=1 procedure=2 type=0 status=0 serial=3
client=0x7f827c3b1010 len=28 program=536903814 version=1 procedure=2 type=1 status=0 serial=3

The example shows the results of running “virsh domname vm1”. There are 4 RPC calls made here, 66 (authenticate), 1 (open), 110 (get uri), 2 (close).

Tracing the client with friendly output

Unless you have memorized libvirt RPC enums, this isn’t a very friendly way to trace the code. This is where the aforementioned libvirt_functions.stp tapset comes into play.

# cat > demo.stp <<EOF
probe libvirt.rpc.client_msg_rx {
  printf("R client=%p len=%d program=%s version=%d procedure=%s type=%s status=%s serial=%d\n",
         client, len,
         libvirt_rpc_program_name(prog, 0),
         vers,
         libvirt_rpc_procedure_name(prog, vers, proc, 0),
         libvirt_rpc_type_name(type, 0),
         libvirt_rpc_status_name(status, 0),
         serial);
}
probe libvirt.rpc.client_msg_tx_queue {
  printf("T client=%p len=%d program=%s version=%d procedure=%s type=%s status=%s serial=%d\n",
         client, len,
         libvirt_rpc_program_name(prog, 0),
         vers,
         libvirt_rpc_procedure_name(prog, vers, proc, 0),
         libvirt_rpc_type_name(type, 0),
         libvirt_rpc_status_name(status, 0),
         serial);
}
EOF
# stap demo.stp
T client=0x7f3e3dec0010 len=28 program=remote version=1 procedure=auth_list type=call status=ok serial=0
R client=0x7f3e3dec0010 len=36 program=remote version=1 procedure=auth_list type=reply status=ok serial=0
T client=0x7f3e3dec0010 len=40 program=remote version=1 procedure=open type=call status=ok serial=1
R client=0x7f3e3dec0010 len=28 program=remote version=1 procedure=open type=reply status=ok serial=1
T client=0x7f3e3dec0010 len=28 program=remote version=1 procedure=get_uri type=call status=ok serial=2
R client=0x7f3e3dec0010 len=48 program=remote version=1 procedure=get_uri type=reply status=ok serial=2
T client=0x7f3e3dec0010 len=28 program=remote version=1 procedure=close type=call status=ok serial=3
R client=0x7f3e3dec0010 len=28 program=remote version=1 procedure=close type=reply status=ok serial=3

Much more friendly !

Tracing the server at the same time

It might desirable to see when the server itself receives the message, independently of when the client transmitted it. There are an identical set of probes available in the server, just replace ‘client’ with ‘server_client’ in the above examples. Thus the demo script can trivially be extended to show server messages at the same time:

# cat >> demo.stp << EOF
probe libvirt.rpc.server_client_msg_rx {
  printf("R server=%p len=%d program=%s version=%d procedure=%s type=%s status=%s serial=%d\n",
         client, len,
         libvirt_rpc_program_name(prog, 0),
         vers,
         libvirt_rpc_procedure_name(prog, vers, proc, 0),
         libvirt_rpc_type_name(type, 0),
         libvirt_rpc_status_name(status, 0),
         serial);
}
probe libvirt.rpc.server_client_msg_tx_queue {
  printf("T server=%p len=%d program=%s version=%d procedure=%s type=%s status=%s serial=%d\n",
         client, len,
         libvirt_rpc_program_name(prog, 0),
         vers,
         libvirt_rpc_procedure_name(prog, vers, proc, 0),
         libvirt_rpc_type_name(type, 0),
         libvirt_rpc_status_name(status, 0),
         serial);
}
# stap demo.stp
T client=0x7ff3c4855010 len=28 program=remote version=1 procedure=auth_list type=call status=ok serial=0
R server=0x17a2070 len=28 program=remote version=1 procedure=auth_list type=call status=ok serial=0
T server=0x17a2070 len=36 program=remote version=1 procedure=auth_list type=reply status=ok serial=0
R client=0x7ff3c4855010 len=36 program=remote version=1 procedure=auth_list type=reply status=ok serial=0
T client=0x7ff3c4855010 len=40 program=remote version=1 procedure=open type=call status=ok serial=1
R server=0x17a2070 len=40 program=remote version=1 procedure=open type=call status=ok serial=1
T server=0x17a2070 len=28 program=remote version=1 procedure=open type=reply status=ok serial=1
R client=0x7ff3c4855010 len=28 program=remote version=1 procedure=open type=reply status=ok serial=1
T client=0x7ff3c4855010 len=28 program=remote version=1 procedure=get_uri type=call status=ok serial=2
R server=0x17a2070 len=28 program=remote version=1 procedure=get_uri type=call status=ok serial=2
T server=0x17a2070 len=48 program=remote version=1 procedure=get_uri type=reply status=ok serial=2
R client=0x7ff3c4855010 len=48 program=remote version=1 procedure=get_uri type=reply status=ok serial=2
T client=0x7ff3c4855010 len=28 program=remote version=1 procedure=close type=call status=ok serial=3
R server=0x17a2070 len=28 program=remote version=1 procedure=close type=call status=ok serial=3
T server=0x17a2070 len=28 program=remote version=1 procedure=close type=reply status=ok serial=3
R client=0x7ff3c4855010 len=28 program=remote version=1 procedure=close type=reply status=ok serial=3

If the server is running on a different host than the client, just copy the demo.stp script to the other host and run a second copy there.

Further extensions

There are many further improvements that can be made to this script

  • Display a timestamp on each message
  • Associate each server side message with an individual socket
  • Display payload length
  • Display a message when the script is actually ready to run

To simplify life, we are maintaining a nice feature demonstration of the RPC SystemTAP probes in the libvirt GIT repository in the
examples/systemtap/rpc-monitor.stp file.

Here is what it can print out

  0.000 begin
  2.632 C + 0x7f1ea57dc010   local=127.0.0.1;0 remote=127.0.0.1;0
  2.632 C > 0x7f1ea57dc010   msg=remote.1.auth_list(call, ok, 0) len=28
  2.632 + S 0x1c1f710        local=127.0.0.1;0 remote=127.0.0.1;0
  2.632 > S 0x1c1f710        msg=remote.1.auth_list(call, ok, 0) len=28
  2.633 < S 0x1c1f710        msg=remote.1.auth_list(reply, ok, 0) len=36
  2.633 C < 0x7f1ea57dc010   msg=remote.1.auth_list(reply, ok, 0) len=36   2.633 C > 0x7f1ea57dc010   msg=remote.1.open(call, ok, 1) len=40
  2.633 > S 0x1c1f710        msg=remote.1.open(call, ok, 1) len=40
  2.639 < S 0x1c1f710        msg=remote.1.open(reply, ok, 1) len=28
  2.639 C < 0x7f1ea57dc010   msg=remote.1.open(reply, ok, 1) len=28   2.639 C > 0x7f1ea57dc010   msg=remote.1.get_uri(call, ok, 2) len=28
  2.639 > S 0x1c1f710        msg=remote.1.get_uri(call, ok, 2) len=28
  2.639 < S 0x1c1f710        msg=remote.1.get_uri(reply, ok, 2) len=48
  2.640 C < 0x7f1ea57dc010   msg=remote.1.get_uri(reply, ok, 2) len=48   2.640 C > 0x7f1ea57dc010   msg=remote.1.domain_lookup_by_id(call, ok, 3) len=32
  2.640 > S 0x1c1f710        msg=remote.1.domain_lookup_by_id(call, ok, 3) len=32
  2.640 < S 0x1c1f710        msg=remote.1.domain_lookup_by_id(reply, error, 3) len=180
  2.641 C < 0x7f1ea57dc010   msg=remote.1.domain_lookup_by_id(reply, error, 3) len=180   2.641 C > 0x7f1ea57dc010   msg=remote.1.close(call, ok, 4) len=28
  2.641 > S 0x1c1f710        msg=remote.1.close(call, ok, 4) len=28
  2.641 < S 0x1c1f710        msg=remote.1.close(reply, ok, 4) len=28
  2.641 C < 0x7f1ea57dc010   msg=remote.1.close(reply, ok, 4) len=28
  2.641 C - 0x7f1ea57dc010   local= remote=
  2.641 - S 0x1c1f710        local=127.0.0.1;0 remote=127.0.0.1;0

Tracing other areas of libvirt code

The RPC code is not the only place with SystemTAP/DTrace probe markers in libvirt. We have also instrumented our main event loop and provide an examples/systemtap/events.stp demo that prints out info like this

  0.000 begin
  2.359 18185 + handle 1 4 1
  2.360 18185 + handle 2 6 1
  2.360 18185 * handle 2 0
  2.360 14370 > handle 3 1
  2.360 14370 + handle 33 16 1
  2.361 14370 ~ 7 -1
  2.361 14370 > handle 33 1
  2.361 14370 * handle 33 1
  2.361 14370 * handle 33 1
  2.361 14370 * handle 33 3
  2.361 14370 ~ 7 -1
  2.361 14370 > handle 1 1
  2.361 14370 ~ 7 -1
  2.361 14370 > handle 33 2
  2.361 14370 * handle 33 1
  2.361 14370 ~ 7 -1
  2.361 18185 * handle 2 1
  2.362 18185 * handle 2 0

And finally we have instrumented our code which talks to the QEMU monitor, again providing a demo examples/systemtap/qemu-monitor.stp which prints out info like this

  0.000 begin
  3.848 ! 0x7f2dc00017b0 {"timestamp": {"seconds": 1319466931, "microseconds": 187755}, "event": "SHUTDOWN"}
  5.773 > 0x7f2dc0007960 {"execute":"qmp_capabilities","id":"libvirt-1"}
  5.774 < 0x7f2dc0007960 {"return": {}, "id": "libvirt-1"}   5.774 > 0x7f2dc0007960 {"execute":"query-commands","id":"libvirt-2"}
  5.777 < 0x7f2dc0007960 {"return": [{"name": "quit"}, {"name": ....snip....   5.777 > 0x7f2dc0007960 {"execute":"query-chardev","id":"libvirt-3"}
  5.778 < 0x7f2dc0007960 {"return": [{"filename": ....snip....   5.779 > 0x7f2dc0007960 {"execute":"query-cpus","id":"libvirt-4"}
  5.780 < 0x7f2dc0007960 {"return": [{"current": true, "CPU": 0, "pc": 1048560, "halted": false, "thread_id": 13299}], "id": "libvirt-4"}   5.780 > 0x7f2dc0007960 {"execute":"set_password","arguments":{"protocol":"vnc","password":"123456","connected":"keep"},"id":"libvirt-5"}
  5.782 < 0x7f2dc0007960 {"return": {}, "id": "libvirt-5"}   5.782 > 0x7f2dc0007960 {"execute":"expire_password","arguments":{"protocol":"vnc","time":"never"},"id":"libvirt-6"}
  5.783 < 0x7f2dc0007960 {"return": {}, "id": "libvirt-6"}   5.783 > 0x7f2dc0007960 {"execute":"balloon","arguments":{"value":224395264},"id":"libvirt-7"}
  5.785 < 0x7f2dc0007960 {"return": {}, "id": "libvirt-7"}   5.785 > 0x7f2dc0007960 {"execute":"cont","id":"libvirt-8"}
  5.789 ! 0x7f2dc0007960 {"timestamp": {"seconds": 1319466933, "microseconds": 129980}, "event": "RESUME"}
  5.789 < 0x7f2dc0007960 {"return": {}, "id": "libvirt-8"}
  7.537 ! 0x7f2dc0007960 {"timestamp": {"seconds": 1319466934, "microseconds": 881214}, "event": "SHUTDOWN"}

Conclusion

The introduction of static probes into the libvirt code has been enormously helpful in understanding the operation of libvirt. While we have comprehensive debug logging present in libvirt is it hard to tailor the output to show the precise data desired. Traditional debuggers like GDB are not very practical when trying to understand the live operation of a heavily multi-threaded system crossing multiple processes, and while strace is useful in some scenarios it is too low level to be useful in most scenarios. SystemTAP userspace probing provides the kind of debugging experience / tool that really suits understanding the complex interactions in a system like libvirt. It is no co-incidence that the first set of probes we have written have focused on the libvirt event loop, RPC code and QEMU monitor – three of the areas in libvirt which are both very critical operationally, and exceptionally hard to debug with traditional approaches. We will certainly be expanding our use of static probe markers in systemtap in the future. My real immediate wishlist is for systemtap to get better at providing userspace stack traces, since it fails to provide a useful trace far too often, as compared to GDB.

Update: Mark Wielaard showed me what I had todo to get nice stack traces from SystemTAP. Apparently it is not getting enough memory space to deal with stack traces with its default settings. Telling it to use a little more memory makes it work nicely:

# cat > demo.stp <<EOF
probe libvirt.rpc.client_msg_rx {
  printf("client=%p len=%d program=%d version=%d procedure=%d type=%d status=%d serial=%d\n",
         client, len, prog, vers, proc, type, status, serial);
  print_ustack(ubacktrace())
}
# stap -DTASK_FINDER_VMA_ENTRY_ITEMS=7680 demo.stp
client=0x7f775cf62010 len=36 program=536903814 version=1 procedure=66 type=1 status=0 serial=0
 0x3c57f0b3dd : virNetClientIOHandleInput+0x87d/0x890 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0b9a0 : virNetClientIOEventLoop+0x5b0/0x630 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0cb23 : virNetClientSend+0x2b3/0x590 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0d47c : virNetClientProgramCall+0x26c/0x8a0 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ef091e : callWithFD+0xce/0x120 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ef099c : call+0x2c/0x40 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57efee80 : doRemoteOpen+0x890/0x20f0 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0341b : remoteOpen+0x9b/0x290 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ec2133 : do_open+0x1f3/0x1100 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ec4616 : virConnectOpenAuth+0x76/0xb0 [/usr/lib64/libvirt.so.0.9.7]
 0x40ceb1 [/usr/bin/virsh+0xceb1/0x40000]
client=0x7f775cf62010 len=28 program=536903814 version=1 procedure=1 type=1 status=0 serial=1
 0x3c57f0b3dd : virNetClientIOHandleInput+0x87d/0x890 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0b9a0 : virNetClientIOEventLoop+0x5b0/0x630 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0cb23 : virNetClientSend+0x2b3/0x590 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0d47c : virNetClientProgramCall+0x26c/0x8a0 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ef091e : callWithFD+0xce/0x120 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ef099c : call+0x2c/0x40 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57eff57a : doRemoteOpen+0xf8a/0x20f0 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57f0341b : remoteOpen+0x9b/0x290 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ec2133 : do_open+0x1f3/0x1100 [/usr/lib64/libvirt.so.0.9.7]
 0x3c57ec4616 : virConnectOpenAuth+0x76/0xb0 [/usr/lib64/libvirt.so.0.9.7]
 0x40ceb1 [/usr/bin/virsh+0xceb1/0x40000]
....

This makes me very happy :-)