Usage of the libvirt virCommand APIs for process spawning

The previous blog post looked at the history of libvirt APIs for spawning processes, up to the current day where there is a single virCommand object + APIs for spawning processes in a very flexible manner. This blog post will now look at the key features of this API and how it is used in practice.

Example usage

Before going into the dry details, lets consider a couple of real world examples where libvirt uses these APIs.

“system” replacement

As a first example, the “virNodeSuspendSetNodeWakeup” method is a place where libvirt might have traditionally used the ‘system’ call.

The goal is to suspend the host, setting a pre-defined wakeup alarm, for which libvirt needs to run the ‘rtcwake’ command. The wakeup time is provided to the API in terms of a unsigned long long which needs to be converted to a string.

If attempting this with the ‘system’ call the code might look like this:

static int virNodeSuspendSetNodeWakeup(unsigned long long alarmTime)
{
    char *setAlarmCmd;
    int ret = -1;

    if (asprintf(&setAlarmCmd, "rtcwake -m no -s %lld", alarmTime) < 0)
        goto cleanup;

    if (system(setAlarmCmd) < 0)
        goto cleanup;

    ret = 0;

 cleanup:
    free(setAlarmCmd);
    return ret;
}

Now consider what this would look like when using the virCommand APIs:

static int virNodeSuspendSetNodeWakeup(unsigned long long alarmTime)
{
    virCommandPtr setAlarmCmd;
    int ret = -1;

    setAlarmCmd = virCommandNewArgList("rtcwake", "-m", "no", "-s", NULL);
    virCommandAddArgFormat(setAlarmCmd, "%lld", alarmTime);

    if (virCommandRun(setAlarmCmd, NULL) < 0)
        goto cleanup;

    ret = 0;

 cleanup:
    virCommandFree(setAlarmCmd);
    return ret;
}

The difference in code complexity is negligible, but the difference in the quality of the implementation is significant.

“popen” replacement

As a second example, the “virStorageBackendIQNFound” method is a place where libvirt might have traditionally used the ‘popen’ call.

The goal this time is to run the iscsiadm command with a number of arguments and parse its stdout to look for a particular iSCSI target.

First consider what this might look like when using ‘popen’

static int virStorageBackendIQNFound(const char *initiatoriqn,
                                          char **ifacename)
{
    int ret = -1;
    FILE *fp = NULL;
    int fd = -1;
    char line[4096]

    if ((fp = popen("iscsiadm --mode iface")) == NULL)
        goto cleanup;

    while (fgets(line, 4096, fp) != NULL) {
       ...analyse line for a match... 
    }

    ret = 0;

cleanup:
    pclose(fp);
    close(fd);
    virCommandFree(cmd);

    return ret;
}

Now consider the re-write to use virCommand APIs

static int virStorageBackendIQNFound(const char *initiatoriqn,
                                          char **ifacename)
{
    int ret = -1;
    FILE *fp = NULL;
    int fd = -1;
    char line[4096]
    virCommandPtr cmd = virCommandNewArgList("iscsiadm", "--mode", "iface", NULL);

    virCommandSetOutputFD(cmd, &fd);
    if (virCommandRunAsync(cmd, NULL) < 0)
        goto cleanup;

    if ((fp = fdopen(fd, "r")) == NULL)
        goto cleanup;

    while (fgets(line, 4096, fp) != NULL) {
       ...analyse line for a match... 
    }
 
    if (virCommandWait(cmd, NULL) < 0)
        goto cleanup;
 
    ret = 0;

cleanup:
    fclose(fp);
    close(fd);
    virCommandFree(cmd);

    return ret;
}

There is a little more work todo for virCommand in terms of initial setup in this example. Technically the call to ‘virCommandWait’ could have been omitted here, since we don’t care about the exit status, but it is good practice to included it. If we had extra dynamic arguments to be provided to ‘iscsiadm’ that needed string formatting the two examples would have been nearer parity in terms of complexity. Even with the slightly longer code for virCommand, the result is a clear win from the quality POV avoiding the many flaws of popen’s implementation.

Detailed API examination

Now that the two examples have given a taste of what the virCommand APIs can do to replace popen/system, lets consider the full set of features exposed. After this it should be clear that the flexibility of the virCommand means there is never any need to delve into fork+exec anymore, let alone popen/system.

Constructing the command arguments

Probably the first task when spawning a command is actually construct the array of arguments and environment variables. In simple cases this can be done immediately when allocating the new virCommad object instance, for example using var-args

virCommandPtr cmd = virCommandNewArgList("touch", "/tmp/foo", NULL);

there is no need to check the return value of virCommandNew* for NULL. The later virCommandRun() API will look for a NULL pointer and report the OOM error at that point instead. The same is true for all error reporting in these APIs – virCommandRun is generally the only place where an error check is needed. This simplification in error handling is a major contributor to making this APIs hard to mis-use in calling code and thus minimizing errors.

Sometimes the list of arguments is not so simple that it can be initialized in one go via var-args. To deal with this it is possible to add arguments to an existing constructed command in a variety of ways

virCommandAddArgFormat(cmd, "--size=%d", 1025);
virCommandAddArgPair(cmd, "--user", "fred");
virCommandAddArgList(cmd, "some", "extra", "args", NULL);

This is handy because for complex command lines (eg those used with QEMU) it allows construction of the virCommand to be split up across multiple functions, each adding their own piece of the command line.

Setting up the environment

By default a process spawned will inherit the full environment of the parent process (almost always libvirtd in libvirt code). With things like QEMU though libvirt wants to be in complete control of the environment it runs under, so it will filter the environment to a subset of names. There are a couple of env variables that are always desired to pass down LD_PRELOAD, LD_LIBRARY_PATH, PATH, USER, HOME, LOGNAME & TMPDIR. If this set is desired there is a convenient method to request passthrough of this set

virCommandAddEnvPassCommand(cmd);

Additional environment variables can be set for passthrough from libvirtd. When passing through environment variables libvirt requires an explicit decision on whether the env variable is safe to pass when running setuid. If an env variable is considered unsafe for a setuid application, there is the option of passing a default value to substitute. The “PATH” variable is unsafe to pass when setuid, and should be set to a known safe value when running setuid:

virCommandAddEnvPassBlockSUID(cmd, "PATH", "/bin:/usr/bin");

The “LD_LIBRARY_PATH” variable is also unsafe when running setuid and should simply be dropped from the environment entirely:

virCommandAddEnvPassBlockSUID(cmd, "LD_LIBRARY_PATH", NULL);

Finally the “LOGNAME” is fine to allow even when setuid so can be left unchanged

virCommandAddEnvPassAllowSUID(cmd, "LOGNAME");

It is not always sufficient to just passthrough existing environment variables, so there are of course APIs to set them directly

virCommandAddEnvPair(cmd, "LOGNAME", "fred");
virCommandAddEnvFormat(cmd, "LOGNAME=%s", fred);
virCommandAddEnvString(cmd, "LOGNAME=fred");

Setting security attributes

Under UNIX a program will inherit process limits, umask and working directory from the parent. It is thus desirable to be able to override this, for example:

virCommandSetMaxFiles(cmd, 65536);

Along a similar vein the child’s umask can also be set

virCommandSetUmask(cmd, 0007);

If the current process’ working directory is unknown, it is a good idea to force an explicit working directory:

virCommandSetWorkingDirectory(cmd, "/");

It may sometimes be desirable to control what capabilities bits a child process has, to override the default behaviour. In such cases the command would to be initialized with the empty set and then the desired bits whitelisted.

virCommandClearCaps(cmd);
virCommandAllowCap(cmd, CAP_NET_RAW);

Finally when interacting with mandatory access control systems like SELinux or AppArmour it is possible to configure an explicit label for the child

virCommandSetSELinuxLabel(cmd, "system_u:system_r:svirt_t:s0:c135,c275");
virCommandSetAppArmorProfile(cmd, "2cb0e828-e6f6-40d1-b0f5-c50cdf34f5c9");

Interacting with stdio

A common requirement when spawning processes is to be able to interact with the child’s stdio in some manner. By default with the virCommand APIs, a process will get its stdin, stdout & stderr connected to /dev/null. For stdin there is a choice of feeding it a fixed length string, or connecting it up to the read end of an existing file descriptor, typically a pipe

virCommandSetInputBuffer(cmd, "Feed me brains");
virCommandSetInputFD(cmd, pipefd);

There are a similar pair of choices for receiving the stdout/stderr from the child. It is possible to supply a pointer to a ‘char *’ which will be filled with the child’s output upon exit. Alternatively a pointer to an ‘int’ can be provided, which can either specify an existing file descriptor, or if ‘-1’ a new anonymous pipe will be created.

char *child_out, *child_err;
virCommandSetOutputBuffer(cmd, &child_out);
virCommandSetErrorBuffer(cmd, &child_err);

int child_out -1, child_err = -1;
virCommandSetOutputFD(cmd, &child_out);
virCommandSetOutputFD(cmd, &child_err);

Passing file descriptors

Aside from any associated with stdio, all file descriptors will be closed when the child process is launched. This is generally a good thing since it prevents any leakage of file descriptors to the child. Such leakage can be a security flaw, and unless using glibc extensions to POSIX, it is not possible to avoid the race condition with setting O_CLOEXEC, so an explicit mass close is a very desirable approach. There can be times when it is is necessary to pass other arbitrary file descriptors to a child process. When passing a file descriptor it may also be desirable to close it in the parent process.

virCommandPassFD(cmd, 8, 0);
virCommandPassFD(cmd, 8, VIR_COMMAND_PASS_FD_CLOSE_PARENT);

Pre-exec callback hooks

Despite its wide range of features, there are times when the virCommand API is not sufficient for the job. In these cases there is the ability to request that a callback function be invoked immediately before exec’ing the child binary. This allows the caller to do arbitrary extra work, though of course bearing in mind POSIX’s rules about which functions are safe to use between fork+exec in a threaded application.

int my_hook(void *data)
{
    if (sometask(data) < 0)
        return -1;
    return 0;
}

virCommandSetPreExecHook(cmd, my_hook, "thefilename");

Executing the command

Everything until this point has been about setting up the command args and the constraints under which it will execute. None of the APIs shown so far require any kind of error checking of return values. Only now that it is time to execute the command will errors be reported and checked by the caller. The simplest way to execute is to use ‘virCommandRun’ which will block until the command finishes running and report the exit status

int status;
if (virCommandRun(cmd, &status) < 0) {
    virCommandFree(cmd);
    return -1;
}

virCommandFree(cmd);

It is possible to leave the ‘status’ arg as NULL in which case the API will turn any non-zero exit status into a fatal error. If the intention is to interact with the command via one or more file descriptors connected to stdio, then a slightly more flexible ‘virCommandRunAsync’ API is required. This call will only block until the process is actually running. The parent can then interact with it and when ready call ‘virCommandWait’.

int status;
pid_t child;
if (virCommandRunAsync(cmd, &child) < 0) {
    virCommandFree(cmd);
    return -1;
}

...interact with child via stdio or other means...
if (virCommandWait(cmd, &status) < 0) {
    virCommandFree(cmd);
    return -1;
}

virCommandFree(cmd);

If something goes wrong during interaction, it is possible to terminate the process with prejudice by using ‘virCommandAbort’ instead of ‘virCommandWait’.

Synchronizing with the child

Normally, once a child has fork’d off, the child & parent will continue execution in parallel, with the parent having no idea at which point the final exec() will have been performed. There can be cases where the parent needs to do some work in between the fork and exec. A pre-exec hook can often be used for this, but the work needs to take place in the parent process another solution is required. To deal with this it is possible to request a handshake take place with the child process. Before the child exec’s its binary, it will notify the parent process that is wants to handshake and wait for a reply. The parent process meanwhile will wait to be notify, then do its work and finally reply to the child again

virCommandRequireHandshake(cmd);

if (virCommandRunAsync(cmd) < 0)
    return -1;

virCommandHandshakeWait(cmd);
...do setup work...
virCommandHandshakeNotify(cmd);

Simplified command execution

The examples above are very powerful, but in the simplest use cases it is possible to combine the virCommandNew + virCommandRun + virCommandFree calls into a single API call

const char *args[] = { "/bin/program", "arg1", "arg2", NULL };
if (virRun(args, NULL) < 0)
  return -1;

This is pretty much equivalent to ‘system’ in terms of complexity, but much safer as it avoids the shell and many other problems mentioned previously.

Integration with unit tests

One of the particularly interesting features of the virCommand APIs is the ability to do unit testing of code that otherwise spawns external commands. The test suite can define a callback that will be invoked any time an attempt is made to run a command. This callback can analyse stdin string, fill in stdout/stderr strings and set the exit status. This is sufficient to avoid the need to run the real command in the context of unit tests in most cases.

static void testCommandCallback(const char *const*args,
                                const char *const*env,
                                const char *input,
                                char **output,
                                char **error,
                                int *status,
                                void *opaque)
{
    ....fake the exit status and fill in **output or **error...
}

virCommandSetDryRun(NULL, testCommandCallback, NULL);

The End

That completes our whirlwind tour of libvirts APIs for spawning child processes. It should be clear that a lot of thought & effort has gone into designing a set of APIs that maximize safety without compromising on ease of use. There can really be no excuse for using either the popen or system calls for spawning programs & thus leaving yourself vulnerable to flaws like shellshock. The libvirt code described in this post is all available under the terms of the LGPLv2+ should anyone wish to pull out & adapt the virCommand APIs for their own programs. I look forward to the day when it is possible to use a Linux system with no reliance on shell by any program. Shell should be exclusively for use by interactive login sessions and administrator local scripting work, not a part of applications where it only ever leads to misery & insecurity.

History of APIs for spawning processes in libvirt without involving the shell

Libvirt has an interesting history when it comes to the spawning of child processes, for a long time eschewing all use of the standard C functions system and popen, instead preferring to use fork+exec via some higher level wrappers of our own design. There were a number of reasons for this decision, some obvious, some not so obvious:

  1. Eliminate all use of shell. Command lines we pass to programs often contain user input. Correctly validating user input to block malicious shell meta characters is non-trivial to get right. Removing use of shell avoids this class of exploits entirely and was the top priority.
  2. Thread safe file descriptor handling. It is generally a bug to allow child processes to inherit file descriptors. In a masterstroke of mis-design UNIX specifies that file descriptors are inheritable across exec by default, requiring an fcntl() call to set the O_CLOEXEC flag to prevent inheritance. Unless using non-standard glibc extensions, setting O_CLOEXEC is a race condition you will often loose in threaded programs using system/popen. The only portable way to 100% guarantee no leakage of file descriptors to child processes is to do a “mass close” of all file descriptors between fork+exec.
  3. Safe signal handling. The system/popen commands do not reset signal handlers after fork(). Thus signal handlers registered by the program are at risk of executing in between fork/exec when spawning external commands. Depending on what the signal handler does this might be a significant problem.
  4. Provide a better API. The system/popen commands are simple to use in the very simple scenarios, but do nothing to help with the more complex scenarios. For example, building up the list of argv to be execute often requires a lot of string & array manipulation.

An aside on bash “shellshock”

In light of the shellshock bug in bash, we’re rather happy that libvirt has taken the approach of avoiding system/popen & shell in general. There are currently only two places where I recall libvirt relying on the shell.

First when using the “SSH” transport for remote API access mechanism, the libvirt client will login to a remote host to setup a tunnel to the libvirtd daemon, and this involves the shell. This can only be used to exploit shellshock if the admin had attempted to limit the user’s access to the remote host using SSH’s ForceCommand config. This fairly uncommon in the context of libvirt, since granting access to the libvirtd daemon implies privileges equivalent to root already, and thus there’s nothing to be gained by login restrictions in SSH.

Second, the libvirt-guests service, which is run by init to suspend guests on shutdown, is written in shell and thus has the potential to be susceptible to shellshock. The risk would come if an unprivileged user had influence over the name of a guest (eg a guest called “() { :;}; echo vulnerable”. Fortunately testing so far indicates that no exploit was possible in the context of libvirt-guests, even with maliciously crafted guest names, since most of the work is done based on UUIDs.

Until a few months ago the network filtering code in libvirt relied in automatically generated huge shell scripts to run iptables commands. We’ve not analysed this obsolete code to see if it was vulnerable to shellshock attacks, but it is believed to be unlikely since the only user controlled strings were iptables chain names and libvirt had strict validation on their content. In any case current libvirt versions have removed all this usage of shell, instead talking to firewalld via the DBus APIs.

Of course it is possible that other external programs that libvirt talks to, in turn use shell & could be vulnerable, but at least the areas that libvirt is responsible appear to be safe from shellshock attacks.

A timeline of process spawning API features

With system/popen ruled out, the alternatives for spawning external programs are fork+exec or possibly posix_spawn. The posix_spawn API is actually fairly decent in terms of the flexibility it offers, but still has a fairly high cost of usage in terms of populating the arguments it needs to receive. Thus, over the years libvirt has evolved a higher level API around the fork+exec system calls that is now in universal use across the codebase. The history below gives a little background on how the APIs evolved to their current form and featureset.

  • Feb 2007 – qemudStartVMDaemon(). When the QEMU driver was first added to libvirt the qemudStartVMDaemon() function was introduced to simplify spawning of the QEMU binary via fork+exec. The notable things this wrapper did were to connect the child processes stdio to pipes and then explicitly close every other file descriptor. This avoided the risky usage of shell and prevents leak of file descriptors. The impl was not reusable at this point since it also contained the code for turning the QEMU configuration into an array of argv
  • Feb 2007 – qemudExec(). When support for running dnsmasq was added to the QEMU driver, qemudStartVMDaemon() was refactored to create the qemudExec() funtion. This was the first re-usable wrapper around fork+exec in libvirt. Given an array of argv parameters it would spawn the child process connecting its stdin to /dev/null and returning a pair of pipes for reading stdout & stderr. This is on a par with popen() in terms of ease of use, but far safer to use due tot he avoidance of shell and safe file descriptor handling. It would gain many more features in the changes that follow
  • Jul, 2007 – _virExec(). With the introduction of the OpenVZ driver, the qemudExec() function was pulled out of the QEMU driver files into a shared utility module. This was the first tentative step towards sharing large amounts of code between different virtualization drivers in libvirt. The functionality remains the same as with qemudExec()
  • Aug, 2008 – _virExec(). The signal handling race mentioned earlier was discovered. A signal handler registered in the parent was set to write to a pipe when it triggered, but _virExec had closed the pipe file descriptor and the FD number happened to have been reused when setting up the pipe to use for the child programs stdio. The fix was to block all signals before running fork() and unblock them after fork(). Before unblocking them though, the child process reset all signal handlers to the defaults.
  • Jan, 2008 – virRun(). The _virExec() API was designed towards long lived child programs, so it would return the child PID and expect the caller to run waitpid later. To simplify spawning of short-lived programs the virRun() AP I was introduced that simply runs _virExec() and then does an immediate waitpid, feeding back the exit status to the caller. This is on a par with system() in terms of ease of use, but far safer to use due to avoidance of shell and safe signal & file descriptor handling.
  • Aug, 2008 – _virExec(). The _virExec() API initially created a pair of pipes to allow the childs stdout/stderr to be fed back to the parent. It was found to be convenient to pass a pre-opened file descriptor instead of creating a new pipe. Thus the _virExec() API was enhanced to allow such usage.
  • Aug, 2008 – _virExec(). Initially all programs spawned would inherit all environment variables set in the libvirtd daemon. The _virExec() API was enhanced to allow a custom set of environments to be passed in, to replace the existing environment. If a custom environment was requested, execve() would be used, otherwise it would continue to use execvp(). The same change also introduced a flag to request that the child program be daemonized. When this is requested, the child will fork() again and the original child would exit. The child process would also have its working directory changed to “/” and become a session leader.
  • Aug, 2008 – _virExec(). As mentioned previously, all file descriptors in the child would be closed and new set of FDs attached to stdin/out/err. To have greater control the _virExec() API was enhanced to allow the caller to specify a set of file descriptors to keep open.
  • Feb, 2009 – _virExec(). When introducing support for sVirt / SELinux to the QEMU driver there was a need to perform some special actions in between the fork() and exec() calls. Rather than hard code this setup for every caller of _virExec(), the ability to provide a pre-exec callback was introduced. This callback would be run immediately before exec() and would be used to set the SELinux process label for QEMU.
  • May, 2009 – _virExec(). Many child programs are able to write out a pidfile, which is particularly useful when daemonizing them, as there is no easy way to identify the second level child PID otherwise. This is a little racy for the parent process though, because there is no guarantee that the pidfile exists by the time _virExec() returns control to the parent. Thus _virExec() was enhanced to allow it to directly create pidfiles when daemonizing a command. The parent is thus guaranteed that the pidfile exists when _virExec() returns.
  • Jun, 2009 – _virExec(). When a privileged process is spawned it will generally inherit all capabilities of the parent process. If it is known that a program won’t require any capabilities even when running as root, then there can be a benefit in removing them. The virExec() API was thus extended to use libcap-ng to optionally clear capabilities of child procceses.
  • Feb, 2010 – _virExec(). Synchronize with internal logging mutex. If a thread was in the middle of a logging call while another thread used virExec() the logging mutex would still be held in the child process. Any attempt to log in the child process would thus deadlock. To deal with this, the logging mutex is aquired before forking the new process and released afterwards. This kind of dance must be done anywhere there is a global mutex that needs to be safely accessed in between a fork+exec pair.
  • Feb, 2010 – virFork(). There were a couple of cases where libvirt needs the ability to fork without exec’ing a new binary. The code for handling fork() and resetting of signal handlers was split out of virExec() to allow to be used independently.
  • May, 2010 – virCommand. The list of parameters to the virExec() API had grown larger than desired, so a new object, virCommand, is introduced. The idea is that an object is populated with all the information related to the command to be run and then an API is invoked to execute it. With virExec the caller was still responsible for allocating the char **argv, but with virCommand there are now helpers to greatly simplify the argv creation and make it less error prone.
  • Nov, 2010 – virCommand. Ordinarily, once a child process has forked it will run asynchronously from the parent process. There are some scenarios in which it is necessary to have a lock-step synchronization between the parent and child process. For example, libvirt’s disk locking needs to acquire leases on storage before the QEMU binary is exec’d but it needs to know the child PID too & the lock acquisition code cannot run in the child PID. The virCommand API is thus extended to introduce a handshake capability. Before exec’ing the new binary the child will send a notification to the parent process and then wait for a response before continuing.
  • Jan, 2012 – virCommand. The virCommand API will either allow the child to inherit all capabilities or block all capabilities. There are some cases where finer control is required, such as spawning LXC containers with limited privileges. The virCommand API was thus extended to allow specific capbilities to be whitelisted in the child.
  • Jan, 2013 – virCommand. It is not uncommon to want to change the UID/GID of a process that is spawned, particularly if it cannot be trusted to change on its own accord, or if it will be unable to change due to filtering of the capabilities to remove the CAP_SETUID bit. The virCommand API is extended to allow an alternate UID+GID to be specified for the command to launch. This change will be done inbetween the fork+exec at the same time as modifying the process capabilities.
  • May, 2013 – virCommand.When spawning QEMU it is necessary to adjust some of the process limits, for example, to raise the maximum file handle count so it is independent of any limit applied to libvirtd. Once again the virCommand APIs are extended so that a number of process limits can be changed in between the fork+exec pair.
  • Mar, 2014 – virCommand. When unit testing code it is desirable to avoid interacting with broader system state, so it is hard to test code which spawns external commands to do work. With the virCommand APIs though it was easy to introduce a dry run mode where the test suite can supply a callback to be invoked instead of the actual command. The callback can send back fake data for stdout/stderr as required to test the calling code.
  • Sep, 2014 – virCommand. Under UNIX, a child process will inherit its parent’s umask by default which is not always desirable. For example although libvirtd may have a umask for 0077, it is desirable for QEMU to get a umask for 0007, so that group shared resources can be set up. The virCommand APIs grew a new option for specifying a umask to set in between the fork+exec stage.

The timeline above shows that libvirt’s own APIs for spawning child processes have grown a huge number of features over the last 7 years of development. They very quickly surpassed system() and popen() in terms of safety from various serious problems while maintaining their ease of use. They have achieved this without having to expose the codebase as a whole to the low level complexities of fork()+exec(). The low level details are isolated in one central place, with the rest of the code using higher level APIs. The next blog post will illustrate just how libvirt’s virCommand APIs are used in practice.