Static analysis to validate mutex locking in libvirt using OCaml + CIL

Posted: May 15th, 2009 | Author: | Filed under: Coding Tips, libvirt, Virt Tools | 1 Comment »

For a long time the libvirtd daemon was single threaded, using a poll() loop to handle and dispatch all I/O requests in a single thread. This was reasonable when each operation was guaranteed to be a fast, but as libvirt has grown in scope, so has the complexity and execution time of many requests. Ultimately we had no choice but to make all the internal drivers thread-safe, and the libvirtd daemon itself multi-threaded. Each libvirt virConnectPtr connection can now be safely used from multiple threads at once. The libvirtd daemon will also process RPC calls from multiple clients concurrently, and even parallelize requests from a single client. Achieving this obviously required that we make significant use of pthread mutexes for locking data structures can be accessed concurrently.

To try and keep the code within the realm of human understanding, we came up with a fixed 2-level locking model. At the first level, each libvirt driver has a lock protecting its global shared state. At the second level, individual managed objects each have a lock (eg, each instance of virDomainObj, virNetworkObj, virStoragePoolObj has a lock). The rules for locking, can be stated pretty concisely

  1. The driver lock must be held when accessing any driver shared state
  2. The managed object lock must be held when accessing any managed object state
  3. The driver lock must be held before obtaining a lock on a managed object
  4. The driver and managed object locks may be released in any order
  5. The locks may not be recursively obtained

With these rules there are three normal locking sequences for each public API entry pointer implemented by a driver. The first, simplest, pattern is for a API that only accesses driver state

lock(driver)
....
....do something with 'driver'.....
....
unlock(driver)

The next pattern is for an API which has work with a managed object. These can usually release the driver lock once the managed object has been locked

lock(driver)
lock(object)
unlock(driver)
....
.... do something with 'object' ....
....
unlock(object)

The final pattern is for an API which has to work with both the driver and managed object state. In this case both locks must be held for the duration of the function

lock(driver)
lock(object)
....
.... do something with 'object' and 'driver'....
....
unlock(object)
unlock(driver)

For the 0.6.0 release I updated several hundred methods in libvirt to follow these 3 locking patterns. Inevitably there would be mistakes along the way, and those reviewing the patches did indeed find some. End users also found some more when we released this. And there is a continual risk of causing accidental regressions. One approach to validating the locking rules is via functional testing, but libvirt is a very complicated and expansive codebase, making it incredibly hard to exercise all possible codepaths in a functional test.

About a year ago, Richard Jones (of OCaml Tutorial fame) did an proof of concept using CIL to analyse libvirt. His examples loaded all libvirt code, generated control flow graphs and reported on exported functions. CIL is an enormously powerful toolset able to parse even very complicated C programs and perform complex static analysis checks. It sounded like just the tool for checking libvirt locking rules, as well as being a good excuse to learn OCaml

Functional languages really force you to think about the concepts you wish to express & check up front, before hacking code. This is possibly what discourages alot of people from using functional languages ;-) So after trying and failing to hack something up quickly, I stopped and considered what I needed to represent. The libvirt internal code has a pretty consistent pattern across different drivers and managed objects. So we can generalize some concepts

  1. Driver data types (eg, struct qemu_driver, xenUnifiedPrivatePtr)
  2. Managed object data types (eg, virDomainObjPtr, virNetworkObjPtr)
  3. A set of methods which lock drivers (eg, qemuDriverLock, xenUnifiedLock)
  4. A set of methods which unlock drivers (eg qemuDriverUnlock, xenUnifiedUnlock)
  5. A set of methods which obtain a locked managed object (eg, virDomainObjFindByName, virDomainObjAssignDef, virNetworkObjectFindByUUID)
  6. A set of methods which unlock managed objects (virDomainObjUnlock)
  7. A set of public API entry points for each driver (the function pointers in each virDriver struct instance)

I won’t go into fine details about CIL, because it is quite a complex beast. Suffice to say, it loads C code, parses it, and builds up an in-memory data structure representing the entire codebase. Todo this it needs the intermediate files from after pre-processing, eg, the ‘.i’ files generated by GCC when the ‘-save-temps’ flag is given. Once it has loaded the code into memory you can start to extract and process the interesting information.

The first step in libvirt analysis is to find all the public API entry points for each driver. Todo this, we search over all the global variables looking for instances of driver structs virDriver, virNetworkDriver, etc, and then processing their initializers to get all the function pointers. At this point we now have a list containing all the functions we wish to check

The next step is to iterate over each function, and run a data flow analysis. CIL provides two useful modules for data flow analysis. One of them runs analysis from the entry point, forwards, to each return point. The other runs analysis from each return point, backwards to the entry point. I can’t really say which module is “best”, but for libvirt locking analysis it felt like a forward data flow analysis would work pretty well. To perform analysis, you provide an implementation of the Cil.DataFlow.ForwardsTransfer interface which has a couple of methods, of which the interesting ones are doStmt and doInstr. The first is invoked once for each C code block, the second is invoked once for each C statement. When invoked they are passed some initial state. The interface implementation looks at the code block or statement in question, and decides what state has changed, and returns the new state. For lock analysis, I decided to store the following state about variables:

  1. The set of driver variables that are currently locked
  2. The set of driver variables that are currently unlocked
  3. The set of object variables that are currently locked
  4. The set of object variables that are currently unlocked

Thus, the first part of the implementation of the doInstr method was simply looking for function calls which lock or unlock a managed object or driver.
From this core state, some additional state can be derived at each C level statement, recording locking mistakes. The mistakes currently identified are

  1. Statements using an unlocked driver variable
  2. Statements using an unlocked object variable
  3. Statement locking a object variable while not holding a locked driver variable
  4. Statements locking a driver variable while holding a locked object variable
  5. Statements causing deadlock by fetching a lock object, while an object is already locked

The lock checker runs a forwards code flow analysis, constructing this state for each interesting function we identified earlier. When complete, it simply looks at the final state accumulated for each function. For each of the 5 types of mistake, it prints out the function name, the line number and the C code causing the mistake. In addition to those 5 mistakes, it also checks the final list of locked variables, and reports on any functions which forget to unlock a managed object, or driver, in any of their exit points.

In summary, we have been able to automatically detect and report on 7 critical locking mistakes across the entire driver codebase without needing to ever run the code. We also have a data flow analysis framework that can be further extended to detect other interesting locking problems. For example, it would be desirable to check whether one public API function calls into another, since this would cause a deadlock with non-recursive mutexes.

The final reports generated from the lock checking tool are designed to make it easily for the person reading them to figure out what locking rule has been violated. An example report from current libvirt CVS codebase looks like this

================================================================
Function: umlDomainStart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 1564 "uml_driver.c"
ret = umlStartVMDaemon(dom->conn, driver___0, vm);
================================================================


================================================================
Function: umlDomainGetAutostart
----------------------------------------------------------------
  - Total exit points with locked vars: 1
  - At exit on #line 1675
return (ret);
^^^^^^^^^
    variables still locked are
      | struct uml_driver * driver___0
  - Total blocks with lock ordering mistakes: 0
================================================================


================================================================
Function: umlDomainSetAutostart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 4
  - Driver used while unlocked on #line 1704
configFile = virDomainConfigFile(dom->conn,
                                 (char const   *)driver___0->configDir,
                                 (char const   *)(vm->def)->name);
  - Driver used while unlocked on #line 1706
autostartLink = virDomainConfigFile(dom->conn,
                                    (char const   *)driver___0->autostartDir,
                                    (char const   *)(vm->def)->name);
  - Driver used while unlocked on #line 1712
err = virFileMakePath((char const   *)driver___0->autostartDir);
  - Driver used while unlocked on #line 1713
virReportSystemErrorFull(dom->conn, 21, err, "uml_driver.c",
                         "umlDomainSetAutostart", 1715U,
                         (char const   *)tmp___1, driver___0->autostartDir);
================================================================


================================================================
Function: testOpen
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 3
  - Driver used while unlocked on #line 663 "test.c"
tmp___20 = virAlloc((void *)(& privconn->domainEventCallbacks),
                    sizeof(*(privconn->domainEventCallbacks)));
  - Driver used while unlocked on #line 663
privconn->domainEventQueue = virDomainEventQueueNew();
  - Driver used while unlocked on #line 670
privconn->domainEventTimer = virEventAddTimeout(-1, & testDomainEventFlush,
                                                (void *)privconn,
                                                (void (*)(void *opaque ))((void *)0));
================================================================


================================================================
Function: qemudClose
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 1815 "qemu_driver.c"
virDomainEventCallbackListRemoveConn(conn, driver___0->domainEventCallbacks);
================================================================


================================================================
Function: qemudDomainSuspend
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 2259
tmp___4 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================


================================================================
Function: qemudDomainResume
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 2312
tmp___4 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================


================================================================
Function: qemudDomainGetSecurityLabel
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 3189
tmp___2 = (*((driver___0->securityDriver)->domainGetSecurityLabel))(dom->conn,
                                                                    vm, seclabel);
================================================================


================================================================
Function: qemudDomainStart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 3630
ret = qemudStartVMDaemon(dom->conn, driver___0, vm, (char const   *)((void *)0),
                         -1);
================================================================


================================================================
Function: qemudDomainAttachDevice
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 4192
tmp___8 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================


================================================================
Function: qemudDomainDetachDevice
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 4316
tmp___3 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================


================================================================
Function: qemudDomainSetAutostart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 4
  - Driver used while unlocked on #line 4381
configFile = virDomainConfigFile(dom->conn,
                                 (char const   *)driver___0->configDir,
                                 (char const   *)(vm->def)->name);
  - Driver used while unlocked on #line 4383
autostartLink = virDomainConfigFile(dom->conn,
                                    (char const   *)driver___0->autostartDir,
                                    (char const   *)(vm->def)->name);
  - Driver used while unlocked on #line 4389
err = virFileMakePath((char const   *)driver___0->autostartDir);
  - Driver used while unlocked on #line 4390
virReportSystemErrorFull(dom->conn, 10, err, "qemu_driver.c",
                         "qemudDomainSetAutostart", 4392U,
                         (char const   *)tmp___1, driver___0->autostartDir);
================================================================


================================================================
Function: qemudDomainMigratePrepare2
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Object fetched while locked objects exist #line 4990
vm = virDomainAssignDef(dconn, & driver___0->domains, def);
================================================================


================================================================
Function: storagePoolRefresh
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 827 "storage_driver.c"
virStoragePoolObjRemove(& driver___0->pools, pool);
================================================================


================================================================
Function: storagePoolSetAutostart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 2
  - Driver used while unlocked on #line 962
err = virFileMakePath((char const   *)driver___0->autostartDir);
  - Driver used while unlocked on #line 963
virReportSystemErrorFull(obj->conn, 18, err, "storage_driver.c",
                         "storagePoolSetAutostart", 965U,
                         (char const   *)tmp___1, driver___0->autostartDir);
================================================================


================================================================
Function: storageVolumeCreateXML
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Object locked while driver is unlocked on #line 1277
virStoragePoolObjLock(pool);
================================================================


================================================================
Function: networkStart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 1228 "network_driver.c"
ret = networkStartNetworkDaemon(net->conn, driver___0, network);
================================================================


================================================================
Function: networkDestroy
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 1
  - Driver used while unlocked on #line 1251
ret = networkShutdownNetworkDaemon(net->conn, driver___0, network);
================================================================


================================================================
Function: networkSetAutostart
----------------------------------------------------------------
  - Total exit points with locked vars: 0
  - Total blocks with lock ordering mistakes: 4
  - Driver used while unlocked on #line 1363
configFile = virNetworkConfigFile(net->conn,
                                  (char const   *)driver___0->networkConfigDir,
                                  (char const   *)(network->def)->name);
  - Driver used while unlocked on #line 1365
autostartLink = virNetworkConfigFile(net->conn,
                                     (char const   *)driver___0->networkAutostartDir,
                                     (char const   *)(network->def)->name);
  - Driver used while unlocked on #line 1371
err = virFileMakePath((char const   *)driver___0->networkAutostartDir);
  - Driver used while unlocked on #line 1372
virReportSystemErrorFull(net->conn, 19, *tmp___1, "network_driver.c",
                         "networkSetAutostart", 1374U, (char const   *)tmp___0,
                         driver___0->networkAutostartDir);
================================================================

In summary, CIL is an incredibly powerful tool, and it is well worth learning OCaml in order to use the CIL framework for doing static analysis of C code. The nummber of bugs it has allowed us to identify and fix more than compensates for effort developing this test harness.

One Comment

gabriel said at 1:06 pm on May 16th, 2009:

Very interresting. But some of your code is needlessly verbose. For example:

let isLockableThingNull exp funcheck =
match exp with
| UnOp (op,exp,typ) -> (
match op with
LNot -> (
match exp with
Lval (lhost, off) -> (
match lhost with
Var vi ->
funcheck vi
| _ -> false
)
| _ -> false
)
| _ -> false
)
| _ ->
false

might be reduced to:

let isLockableThingNull exp funcheck =
match exp with
| UnOp (LNot,Lval (Var vi,off),typ) -> funcheck vi
| _ -> false

Happy hacking!

Leave a Reply





Spam protection: Sum of tw0 plus t3n ?: