History of APIs for spawning processes in libvirt without involving the shell

Libvirt has an interesting history when it comes to the spawning of child processes, for a long time eschewing all use of the standard C functions system and popen, instead preferring to use fork+exec via some higher level wrappers of our own design. There were a number of reasons for this decision, some obvious, some not so obvious:

  1. Eliminate all use of shell. Command lines we pass to programs often contain user input. Correctly validating user input to block malicious shell meta characters is non-trivial to get right. Removing use of shell avoids this class of exploits entirely and was the top priority.
  2. Thread safe file descriptor handling. It is generally a bug to allow child processes to inherit file descriptors. In a masterstroke of mis-design UNIX specifies that file descriptors are inheritable across exec by default, requiring an fcntl() call to set the O_CLOEXEC flag to prevent inheritance. Unless using non-standard glibc extensions, setting O_CLOEXEC is a race condition you will often loose in threaded programs using system/popen. The only portable way to 100% guarantee no leakage of file descriptors to child processes is to do a “mass close” of all file descriptors between fork+exec.
  3. Safe signal handling. The system/popen commands do not reset signal handlers after fork(). Thus signal handlers registered by the program are at risk of executing in between fork/exec when spawning external commands. Depending on what the signal handler does this might be a significant problem.
  4. Provide a better API. The system/popen commands are simple to use in the very simple scenarios, but do nothing to help with the more complex scenarios. For example, building up the list of argv to be execute often requires a lot of string & array manipulation.

An aside on bash “shellshock”

In light of the shellshock bug in bash, we’re rather happy that libvirt has taken the approach of avoiding system/popen & shell in general. There are currently only two places where I recall libvirt relying on the shell.

First when using the “SSH” transport for remote API access mechanism, the libvirt client will login to a remote host to setup a tunnel to the libvirtd daemon, and this involves the shell. This can only be used to exploit shellshock if the admin had attempted to limit the user’s access to the remote host using SSH’s ForceCommand config. This fairly uncommon in the context of libvirt, since granting access to the libvirtd daemon implies privileges equivalent to root already, and thus there’s nothing to be gained by login restrictions in SSH.

Second, the libvirt-guests service, which is run by init to suspend guests on shutdown, is written in shell and thus has the potential to be susceptible to shellshock. The risk would come if an unprivileged user had influence over the name of a guest (eg a guest called “() { :;}; echo vulnerable”. Fortunately testing so far indicates that no exploit was possible in the context of libvirt-guests, even with maliciously crafted guest names, since most of the work is done based on UUIDs.

Until a few months ago the network filtering code in libvirt relied in automatically generated huge shell scripts to run iptables commands. We’ve not analysed this obsolete code to see if it was vulnerable to shellshock attacks, but it is believed to be unlikely since the only user controlled strings were iptables chain names and libvirt had strict validation on their content. In any case current libvirt versions have removed all this usage of shell, instead talking to firewalld via the DBus APIs.

Of course it is possible that other external programs that libvirt talks to, in turn use shell & could be vulnerable, but at least the areas that libvirt is responsible appear to be safe from shellshock attacks.

A timeline of process spawning API features

With system/popen ruled out, the alternatives for spawning external programs are fork+exec or possibly posix_spawn. The posix_spawn API is actually fairly decent in terms of the flexibility it offers, but still has a fairly high cost of usage in terms of populating the arguments it needs to receive. Thus, over the years libvirt has evolved a higher level API around the fork+exec system calls that is now in universal use across the codebase. The history below gives a little background on how the APIs evolved to their current form and featureset.

  • Feb 2007 – qemudStartVMDaemon(). When the QEMU driver was first added to libvirt the qemudStartVMDaemon() function was introduced to simplify spawning of the QEMU binary via fork+exec. The notable things this wrapper did were to connect the child processes stdio to pipes and then explicitly close every other file descriptor. This avoided the risky usage of shell and prevents leak of file descriptors. The impl was not reusable at this point since it also contained the code for turning the QEMU configuration into an array of argv
  • Feb 2007 – qemudExec(). When support for running dnsmasq was added to the QEMU driver, qemudStartVMDaemon() was refactored to create the qemudExec() funtion. This was the first re-usable wrapper around fork+exec in libvirt. Given an array of argv parameters it would spawn the child process connecting its stdin to /dev/null and returning a pair of pipes for reading stdout & stderr. This is on a par with popen() in terms of ease of use, but far safer to use due tot he avoidance of shell and safe file descriptor handling. It would gain many more features in the changes that follow
  • Jul, 2007 – _virExec(). With the introduction of the OpenVZ driver, the qemudExec() function was pulled out of the QEMU driver files into a shared utility module. This was the first tentative step towards sharing large amounts of code between different virtualization drivers in libvirt. The functionality remains the same as with qemudExec()
  • Aug, 2008 – _virExec(). The signal handling race mentioned earlier was discovered. A signal handler registered in the parent was set to write to a pipe when it triggered, but _virExec had closed the pipe file descriptor and the FD number happened to have been reused when setting up the pipe to use for the child programs stdio. The fix was to block all signals before running fork() and unblock them after fork(). Before unblocking them though, the child process reset all signal handlers to the defaults.
  • Jan, 2008 – virRun(). The _virExec() API was designed towards long lived child programs, so it would return the child PID and expect the caller to run waitpid later. To simplify spawning of short-lived programs the virRun() AP I was introduced that simply runs _virExec() and then does an immediate waitpid, feeding back the exit status to the caller. This is on a par with system() in terms of ease of use, but far safer to use due to avoidance of shell and safe signal & file descriptor handling.
  • Aug, 2008 – _virExec(). The _virExec() API initially created a pair of pipes to allow the childs stdout/stderr to be fed back to the parent. It was found to be convenient to pass a pre-opened file descriptor instead of creating a new pipe. Thus the _virExec() API was enhanced to allow such usage.
  • Aug, 2008 – _virExec(). Initially all programs spawned would inherit all environment variables set in the libvirtd daemon. The _virExec() API was enhanced to allow a custom set of environments to be passed in, to replace the existing environment. If a custom environment was requested, execve() would be used, otherwise it would continue to use execvp(). The same change also introduced a flag to request that the child program be daemonized. When this is requested, the child will fork() again and the original child would exit. The child process would also have its working directory changed to “/” and become a session leader.
  • Aug, 2008 – _virExec(). As mentioned previously, all file descriptors in the child would be closed and new set of FDs attached to stdin/out/err. To have greater control the _virExec() API was enhanced to allow the caller to specify a set of file descriptors to keep open.
  • Feb, 2009 – _virExec(). When introducing support for sVirt / SELinux to the QEMU driver there was a need to perform some special actions in between the fork() and exec() calls. Rather than hard code this setup for every caller of _virExec(), the ability to provide a pre-exec callback was introduced. This callback would be run immediately before exec() and would be used to set the SELinux process label for QEMU.
  • May, 2009 – _virExec(). Many child programs are able to write out a pidfile, which is particularly useful when daemonizing them, as there is no easy way to identify the second level child PID otherwise. This is a little racy for the parent process though, because there is no guarantee that the pidfile exists by the time _virExec() returns control to the parent. Thus _virExec() was enhanced to allow it to directly create pidfiles when daemonizing a command. The parent is thus guaranteed that the pidfile exists when _virExec() returns.
  • Jun, 2009 – _virExec(). When a privileged process is spawned it will generally inherit all capabilities of the parent process. If it is known that a program won’t require any capabilities even when running as root, then there can be a benefit in removing them. The virExec() API was thus extended to use libcap-ng to optionally clear capabilities of child procceses.
  • Feb, 2010 – _virExec(). Synchronize with internal logging mutex. If a thread was in the middle of a logging call while another thread used virExec() the logging mutex would still be held in the child process. Any attempt to log in the child process would thus deadlock. To deal with this, the logging mutex is aquired before forking the new process and released afterwards. This kind of dance must be done anywhere there is a global mutex that needs to be safely accessed in between a fork+exec pair.
  • Feb, 2010 – virFork(). There were a couple of cases where libvirt needs the ability to fork without exec’ing a new binary. The code for handling fork() and resetting of signal handlers was split out of virExec() to allow to be used independently.
  • May, 2010 – virCommand. The list of parameters to the virExec() API had grown larger than desired, so a new object, virCommand, is introduced. The idea is that an object is populated with all the information related to the command to be run and then an API is invoked to execute it. With virExec the caller was still responsible for allocating the char **argv, but with virCommand there are now helpers to greatly simplify the argv creation and make it less error prone.
  • Nov, 2010 – virCommand. Ordinarily, once a child process has forked it will run asynchronously from the parent process. There are some scenarios in which it is necessary to have a lock-step synchronization between the parent and child process. For example, libvirt’s disk locking needs to acquire leases on storage before the QEMU binary is exec’d but it needs to know the child PID too & the lock acquisition code cannot run in the child PID. The virCommand API is thus extended to introduce a handshake capability. Before exec’ing the new binary the child will send a notification to the parent process and then wait for a response before continuing.
  • Jan, 2012 – virCommand. The virCommand API will either allow the child to inherit all capabilities or block all capabilities. There are some cases where finer control is required, such as spawning LXC containers with limited privileges. The virCommand API was thus extended to allow specific capbilities to be whitelisted in the child.
  • Jan, 2013 – virCommand. It is not uncommon to want to change the UID/GID of a process that is spawned, particularly if it cannot be trusted to change on its own accord, or if it will be unable to change due to filtering of the capabilities to remove the CAP_SETUID bit. The virCommand API is extended to allow an alternate UID+GID to be specified for the command to launch. This change will be done inbetween the fork+exec at the same time as modifying the process capabilities.
  • May, 2013 – virCommand.When spawning QEMU it is necessary to adjust some of the process limits, for example, to raise the maximum file handle count so it is independent of any limit applied to libvirtd. Once again the virCommand APIs are extended so that a number of process limits can be changed in between the fork+exec pair.
  • Mar, 2014 – virCommand. When unit testing code it is desirable to avoid interacting with broader system state, so it is hard to test code which spawns external commands to do work. With the virCommand APIs though it was easy to introduce a dry run mode where the test suite can supply a callback to be invoked instead of the actual command. The callback can send back fake data for stdout/stderr as required to test the calling code.
  • Sep, 2014 – virCommand. Under UNIX, a child process will inherit its parent’s umask by default which is not always desirable. For example although libvirtd may have a umask for 0077, it is desirable for QEMU to get a umask for 0007, so that group shared resources can be set up. The virCommand APIs grew a new option for specifying a umask to set in between the fork+exec stage.

The timeline above shows that libvirt’s own APIs for spawning child processes have grown a huge number of features over the last 7 years of development. They very quickly surpassed system() and popen() in terms of safety from various serious problems while maintaining their ease of use. They have achieved this without having to expose the codebase as a whole to the low level complexities of fork()+exec(). The low level details are isolated in one central place, with the rest of the code using higher level APIs. The next blog post will illustrate just how libvirt’s virCommand APIs are used in practice.

Announce: libvirt-sandbox “Cholistan” 0.5.1 release – an application sandbox toolkit

Posted: November 19th, 2013 | Filed under: Fedora, libvirt, Security, Virt Tools | Tags: , , , , , | No Comments »

I pleased to announce the a new public release of libvirt-sandbox, version 0.5.1, is now available from:

http://sandbox.libvirt.org/download/

The packages are GPG signed with

  Key fingerprint: DAF3 A6FD B26B 6291 2D0E  8E3F BE86 EBB4 1510 4FDF (4096R)

The libvirt-sandbox package provides an API layer on top of libvirt-gobject which facilitates the cration of application sandboxes using virtualization technology. An application sandbox is a virtual machine or container that runs a single application binary, directly from the host OS filesystem. In other words there is no separate guest operating system install to build or manage.

At this point in time libvirt-sandbox can create sandboxes using either LXC or KVM, and should in theory be extendable to any libvirt driver.

This release focused on exclusively on bugfixing

Changed in this release:

  • Fix path to systemd binary (prefers dir /lib/systemd not /bin)
  • Remove obsolete commands from virt-sandbox-service man page
  • Fix delete of running service container
  • Allow use of custom root dirs with ‘virt-sandbox –root DIR’
  • Fix ‘upgrade’ command for virt-sandbox-service generic services
  • Fix logrotate script to use virsh for listing sandboxed services
  • Add ‘inherit’ option for virt-sandbox ‘-s’ security context option, to auto-copy calling process’ context
  • Remove non-existant ‘-S’ option froom virt-sandbox-service man page
  • Fix line break formatting of man page
  • Mention LIBVIRT_DEFAULT_URI in virt-sandbox-service man page
  • Check some return values in libvirt-sandbox-init-qemu
  • Remove unused variables
  • Fix crash with partially specified mount option string
  • Add man page docs for ‘ram’ mount type
  • Avoid close of un-opened file descriptor
  • Fix leak of file handles in init helpers
  • Log a message if sandbox cleanup fails
  • Cope with domain being missing when deleting container
  • Improve stack trace diagnostics in virt-sandbox-service
  • Fix virt-sandbox-service content copying code when faced with non-regular files.
  • Improve error reporting if kernel does not exist
  • Allow kernel version/path/kmod to be set with virt-sandbox
  • Don’t overmount ‘/root’ in QEMU sandboxes by default
  • Fix nosuid / nodev mount options for tmpfs
  • Force 9p2000.u protocol version to avoid QEMU bugs
  • Fix cleanup when failing to start interactive sandbox
  • Create copy of kernel from /boot to allow relabelling
  • Bulk re-indent of code
  • Avoid crash when gateway is missing in network options
  • Fix symlink target created in multi-user.target.wants
  • Add ‘-p PATH’ option for virt-sandbox-service clone/delete to match ‘create’ command option.
  • Only allow ‘lxc:///’ URIs with virt-sandbox-service until further notice
  • Rollback state if cloning a service sandbox fails
  • Add more kernel modules instead of assuming they are all builtins
  • Don’t complain if some kmods are missing, as they may be builtins
  • Allow –mount to be repeated with virt-sandbox-service

Thanks to everyone who contributed to this release

Fine grained access control in libvirt using polkit

Posted: August 12th, 2013 | Filed under: Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , | 3 Comments »

Historically access control to libvirt has been very coarse, with only three privilege levels “anonymous” (only authentication APIs are allowed), “read-only” (only querying information is allowed) and “read-write” (anything is allowed). Over the past few months I have been working on infrastructure inside libvirt to support fine grained access control policies. The initial code drop arrived in libvirt 1.1.0, and the wiring up of authorization checks in drivers was essentially completed in libvirt 1.1.1 (with the exception of a handful of APIs in the legacy Xen driver code). We did not wish to tie libvirt to any single access control system, so the framework inside libvirt is modular, to allow for multiple plugins to be developed. The only plugin provided at this time makes use of polkit for its access control checks. There was a second proof of concept plugin that used SELinux to provide MAC, but there are a number of design issues still to be resolved with that, so it is not merged at this time.

The basic framework integration

The libvirt library exposes a number of objects (virConnectPtr, virDomainPtr, virNetworkPtr, virNWFilterPtr, virNodeDevicePtr, virIntefacePtr, virSecretPtr, virStoragePoolPtr, virStorageVolPtr), with a wide variety of operations defined in the public API. Right away it was clear that we did not wish to describe access controls based on the names of the APIs themselves. For each object there are a great many APIs which all imply the same level of privilege, so it made sense to collapse those APIs onto single permission bits. At the same time, some individual APIs could have multiple levels of privilege depending on the flags set in parameters, so would expand to multiple permission bits. Thus the first task to was come up with a list of permission bits which were able to cover all APIs. This was encoded in the internal viraccessperm.h header file. With the permissions defined, the next big task was to define a mapping between permissions and APIs. This mapping was encoded as magic comments in the RPC protocol definition file. This in turn allows the code for performing access control checks to be automatically generated, thus minimizing scope for coding errors, such as forgetting to perform checks in a method, or performing the wrong checks.

The final coding step was for the automatically generated ACL check methods to be inserted into each of the libvirt driver APIs. Most of the ACL checks validate the input parameters to ensure the caller is authorized to operate on the object in question. In a number of methods, the ACL checks are used to restrict / filter the data returned. For example, if asking for a list of domains, the returned list must be filtered to only those the client is authorized to see. While the code for checking permissions was auto-generated, it is not practical to automatically insert the checks into each libvirt driver. It was, however, possible to write scripts to perform static analysis on the code to validate that each driver had the full set of access control checks present. Of course it helps to tell developers / administrators which permissions apply to each API, so the code which generates the API reference documentation was also enhanced so that the API reference lists the permissions required in each circumstance.

The polkit access control driver

Libvirt has long made use of polkit for authenticating connections over its UNIX domain sockets. It was thus natural to expand on this work to make use of polkit as a driver for the access control framework. Historically this would not have been practical, because the polkit access control rule format did not provide a way for the admin to configure access control checks on individual object instances – only object classes. In polkit 0.106, however, a new engine was added which allowed admins to use javascript to write access control policies. The libvirt polkit driver takes object class names and permission names to form polkit action names. For example, the “getattr” permission on the virDomainPtr class maps to the polkit org.libvirt.api.domain.getattr permission. When performing a access control check, libvirt then populates the polkit authorization “details” map with one or more attributes which uniquely identify the object instance. For example, the virDomainPtr object gets “connect_driver” (libvirt driver name), “domain_uuid” (globally unique UUID), and “domain_name” (host local unique name) details set. These details can be referenced in the javascript policy to scope rules to individual object instances.

Consider a local user berrange who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to use the QEMU driver and not the Xen or LXC drivers which are also available in libvirtd. To achieve this we need to write a rule which checks whether the connect_driver attribute is QEMU, and match on an action name of org.libvirt.api.connect.getattr. Using the javascript rules format, this ends up written as

polkit.addRule(function(action, subject) {
    if (action.id == "org.libvirt.api.connect.getattr" &&
        subject.user == "berrange") {
          if (action.lookup("connect_driver") == 'QEMU') {
            return polkit.Result.YES;
          } else {
            return polkit.Result.NO;
          }
    }
});

As another example, consider a local user berrange who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to see the domain called demo on the LXC driver. To achieve this we need to write a rule which checks whether the connect_driver attribute is LXC and the domain_name attribute is demo, and match on a action name of org.libvirt.api.domain.getattr. Using the javascript rules format, this ends up written as

polkit.addRule(function(action, subject) {
    if (action.id == "org.libvirt.api.domain.getattr" &&
        subject.user == "berrange") {
          if (action.lookup("connect_driver") == 'LXC' &&
              action.lookup("domain_name") == 'demo') {
            return polkit.Result.YES;
          } else {
            return polkit.Result.NO;
          }
    }
});

Futher work

While the access control support in libvirt 1.1.1 provides a useful level of functionality, there is still more that can be done in the future. First of all, the polkit driver needs to have some performance optimization work done. It currently relies on invoking the ‘pkcheck’ binary to validate permissions. While this is fine for hosts with small numbers of objects, it will quickly become too costly. The solution here is to directly use the DBus API from inside libvirt.

The latest polkit framework is fairly flexible in terms of letting us identify object instances via the details map it associates with every access control check. It is far less flexible in terms of identifying the client user. It is fairly locked into the idea of identifying users via remote PID or DBus service name, and then exposing the username/groupnames to the javascript rules files. While this works fine for local libvirt connections over UNIX sockets, it is pretty much useless for connections arriving on libvirt’s TCP sockets. In the latter case the libvirt user is identied by a SASL username (typically a Kerberos principal name), or via an x509 certificate distinguished name (when using client certs with TLS). There’s no way official way to feed the SASL username of x509 dname down to the polkit javascript authorization rules files. Requests upstream to allow extra identifying attributes to be provide for the authorization subject have not been productive, so I’m considering (ab-)using the “details” map to provide identifying info for the users, alongside the identifying info for the object.

As mentioned earlier, there was a proof of concept SELinux driver written, that is yet to be finished. The work there is around figuring out / defining what the SELinux context is for each object to be checked and doing some work on SELinux policy. I think of this work as providing a capability similar to that done in PostgreSQL to enable SELinux MAC checks. It would be very nice to have a system which provides end-to-end MAC. I refer to this as sVirt 2.0 – the first (current) version of sVirt protected the host from guests – the second (future) version would also protect the host from management clients.

The legacy XenD based Xen driver has a couple of methods which lack access control, due to the inability to get access to the identifying attributes for the objects being operated upon. While we encourage people to use the new libxl based Xen driver, it is desirable to have the legacy Xen driver fully controlled for those people using legacy virtualization hosts. Some code refactoring will be required to fix the legacy Xen driver, likely at the cost of making some methods less efficient.

If there is user demand, work may be done to write an access control driver which is natively implemented entirely within libvirt. While the polkit javascript engine is fairly flexible, I’m not much of a fan of having administrators write code to define their access control policy. It would be preferable to have a way to describe the policy that was entirely declarative. With a libvirt native access control driver, it would be possible to create a simple declarative policy file format tailored to our precise needs. This would let us solve the problem of providing identifying info about the subject being checked. It would also have the potential to be more scalable by avoiding the need to interact with any remote authorization deamons over DBus. The latter could be a big deal when an individual API call needs to check 1000’s of permissions at once. The flipside of course, is that a libvirt specific access control driver is not good for interoperability across the broader system – the standardized use of polkit is good in that respect. There’s no technical reason why we can’t support multiple access control drivers to give the administrator choice / flexibility.

Finally, this work is all scheduled to arrive in Fedora 20, so anyone interested in testing it should look at current rawhide, or keep an eye out for the Fedora 20 virtualization test day.

EDITED: Aug 15th: Change example use of ‘action._detail_connect_driver’ to ‘action.lookup(“connect_driver”)’

A reminder why you should never mount guest disk images on the host OS

Posted: February 20th, 2013 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , , , , | 1 Comment »

The OpenStack Nova project has the ability to inject files into a guest filesystem immediately prior to booting the virtual machine instance. Historically the way it did this was to setup either a loop or NBD device on the host OS and then mount the guest filesystem directly on the host OS. One of the high priority tasks for Red Hat engineers when we became involved in OpenStack was to integrate libguestfs FUSE into Nova to replace the use of loop back + NBD devices, and then subsequently refactor Nova to introduce a VFS layer which enables use of the native libguestfs API to avoid any interaction with the host filesystem at all.

There has already been a vulnerability in the Nova code which allowed a malicious user to inject files to arbitrary locations in the host filesystem. This has of course been fixed, but even so mounting guest disk images on the host OS should still be considered very bad practice. The libguestfs manual describes the remaining risk quite well:

When you mount a filesystem, mistakes in the kernel filesystem (VFS) can be escalated into exploits by attackers creating a malicious filesystem. These exploits are very severe for two reasons. Firstly there are very many filesystem drivers in the kernel, and many of them are infrequently used and not much developer attention has been paid to the code. Linux userspace helps potential crackers by detecting the filesystem type and automatically choosing the right VFS driver, even if that filesystem type is unexpected. Secondly, a kernel-level exploit is like a local root exploit (worse in some ways), giving immediate and total access to the system right down to the hardware level

Libguestfs provides protection against this risk by creating a virtual machine inside which all guest filesystem manipulations are performed. Thus even if the guest kernel gets compromised by a VFS flaw, the attacker then still has to break out of the KVM virtual machine and its sVirt confinement to stand a chance of compromising the host OS. Some people have doubted the severity of this kernel VFS driver risk in the past, but an article posted on LWN today should serve reinforce the fact that libguestfs is right to be paranoid. The article highlights two kernel filesystem vulnerabilities (one in ext4 which is enabled in pretty much all Linux hosts) which left hosts vulnerable for as long as 3 years in some cases:

  • CVE-2009-4307: a divide-by-zero crash in the ext4 filesystem code. Causing this oops requires convincing the user to mount a specially-crafted ext4 filesystem image
  • CVE-2009-4020: a buffer overflow in the HFS+ filesystem exploitable, once again, by convincing a user to mount a specially-crafted filesystem image on the target system.

If the user has access to an OpenStack deployment which is not using libguestfs for file injection, then “convincing a user to mount a specially crafted filesystem” merely requires them to upload their evil filesystem image to glance and then request Nova to boot it.

Anyone deploying OpenStack with file injection enabled, is strongly advised to make sure libguestfs is installed to avoid any direct exposure of the host OS kernel to untrusted guest images.

While I picked on OpenStack as a convenient way to illustrate the problem here, it is not unique to OpenStack. Far too frequently I find documentation relating to virtualization that suggests people mount untrusted disk images directly on their OS. Based on their documented features I’m confident that a number of public virtual machine hosting companies will be mounting untrusted user disk images on their virtualization hosts, likely without using libguestfs for protection.

An idea for a race-free way to kill processes with safety checks

Posted: November 29th, 2012 | Filed under: Coding Tips, Fedora, libvirt, Security | Tags: , , , , , , | 2 Comments »

When a program has to kill a process, it may well want to perform some safety checks before invoking kill(), to ensure that the process really is the one that it is expecting. For example, it might check that the /proc/$PID/exe symlink matches the executable it originally used to start it. Or it might want to check the uid/gid the process is running as from /proc/$PID/status. Or any number of other attributes about a process that are exposed in /proc/$PID files.

This is really a similar pattern to the checks that a process might do when operating on a file. For example, it may stat() the file to check it has the required uid/gid ownership. It has long been known that doing stat() followed by open() is subject to a race condition which can have serious security implications in many circumstances. For this reason, well written apps will actually do open() and then fstat() on the file descriptor, so they’re guaranteed the checks are performed against the file they’re actually operating on.

If the kernel ever wraps around on PID numbers, it should be obvious that apps that do checks on /proc/$PID before killing a process are also subject to a race condition with potential security implications. This leads to the questions of whether it is possible to design a race free way of checking & killing a process, similar to the open() + fstat() pattern used for files. AFAICT there is not, but there are a couple of ways that one could be constructed without too much extra support from the kernel.

To attack this, we first need an identifier for the process which is unique and not reusable if a process dies & new one takes it place. As mentioned above, if the kernel allows PID numbers to wrap, the PID is unsuitable for this purpose. A file handle for the /proc/$PID directory though, it potentially suitable for this purpose. We still can’t simply open files like /proc/$PID/exe directly, since that’s relying on the potentially reusable PID number. Here the openat()/readlinkat() system calls come to rescuse. They allow you to pass a file descriptor for a directory and a relative filename. This guarantees that the file opened is in the directory expected, even if the original directory path changes / is replaced. All that is now missing is a way to kill a process, given a FD for /proc/$PID. There are two ideas for this, first a new system call fkill($fd, $signum), or a new proc file /proc/$PID/kill where you can write signal numbers.

With these pieces, a race free way to validate a process’ executable before killing it would look like

int pidfd = open("/proc/1234", O_RDONLY);
char buf[PATH_MAX];
char *exe = readlinkat(pidfd, "exe", buf, sizeof(buf));
if (strcmp(exe, "/usr/bin/mybinary") != 0)
return -1;
fkill(pidfd, SIGTERM);

Alternatively the last line could look like

char signum[10];
snprintf(signum, sizeof(signum), "%d", SIGTERM);
int sigfd = openat(pidfd, "kill", O_WRONLY);
write(sigfd, signum, strlen(signum))
close(sigfd)

Though I prefer the fkill() variant for reasons of brevity.

After reading this, you might wonder how something like systemd safely kills processes it manages without this kind of functionality. Well, it uses cgroups to confine processes associated with each service. Therefore it does not need to care about safety checking arbitrary PIDs, instead it can merely iterate over every PID in the service’s cgroup and kill them. The downside of cgroups is that it is Linux specific, but that doesn’t matter for systemd’s purposes. I believe that a fkill() syscall, however, is something that could be more easily used in a variety of existing apps without introducing too much portability code. Apps could simply use fkill() if available, but fallback to regular kill() elsewhere.

NB I’ve no intention of actually trying to implement this, since I like to stay out of kernel code. It is just an idea I wanted to air in case it interests anyone else.