Announce: libvirt-sandbox “Dashti Margo” 0.6.0 release – an application sandbox toolkit

Posted: July 1st, 2015 | Filed under: Fedora, libvirt, Security, Virt Tools | Tags: , , , , , | 2 Comments »

I pleased to announce the a new public release of libvirt-sandbox, version 0.6.0, is now available from:

http://sandbox.libvirt.org/download/

The packages are GPG signed with

  Key fingerprint: DAF3 A6FD B26B 6291 2D0E  8E3F BE86 EBB4 1510 4FDF (4096R)

The libvirt-sandbox package provides an API layer on top of libvirt-gobject which facilitates the cration of application sandboxes using virtualization technology. An application sandbox is a virtual machine or container that runs a single application binary, directly from the host OS filesystem. In other words there is no separate guest operating system install to build or manage.

At this point in time libvirt-sandbox can create sandboxes using either LXC or KVM, and should in theory be extendable to any libvirt driver.

This release contains a mixture of new features and bugfixes.

The first major feature is the ability to provide block devices to sandboxes. Most of the time sandboxes only want/need filesystems, but there are some use cases where block devices are useful. For example, some applications (like databases) can directly use raw block devices for storage. Another one is where a tool actually wishes to be able to format filesystems and have this done inside the container. The complexity with exposing block devices is giving the sandbox tools a predictable path for accessing the device which does not change across hypervisors. To solve this, instead of allowing users of virt-sandbox to specify a block device name, they provide an opaque tag name. The block device is then made available at a path /dev/disk/by-tag/TAGNAME, which symlinks back to whatever hypervisor specific disk name was used.

The second major feature is the ability to provide a custom root filesystem for the sandbox. The original intent of the sandbox tool was that it provide an easy way to confine and execute applications that are installed on the host filesystem, so by default the host / filesystem is mapped to the sandbox / filesystem read-only. There are some use cases, however, where the user may wish to have a completely different root filesystem. For example, they may wish to execute applications from some separate disk image. So virt-sandbox now allows the user to map in a different root filesystem for the sandbox.

Both of these features were developed as part of a Google Summer of Code 2015 project which is aiming to enhance libvirt sandbox so that it is capable of executing images distributed by the Docker container image repository service. The motivation for this goes back to the original reason for creating the libvirt-sandbox project in the first place, which was to provide a hypervisor agnostic framework for sandboxing applications, as a higher level above the libvirt API. Once this is work is complete it’ll be possible to launch Docker images via libvirt QEMU, KVM or LXC, with no need for the Docker toolchain itself.

The detailed list of changes in this release is:

  • API/ABI in-compatible change, soname increased
  • Prevent use of virt-sandbox-service as non-root upfront
  • Fix misc memory leaks
  • Block SIGHUP from the dhclient binary to prevent accidental death if the controlling terminal is closed & reopened
  • Add support for re-creating libvirt XML from sandbox config to facilitate upgrades
  • Switch to standard gobject introspection autoconf macros
  • Add ability to set filters on network interfaces
  • Search /usr/lib instead of /lib for systemd unit files, as the former is the canonical location even when / and /usr are merged
  • Only set SELinux labels on hosts that support SELinux
  • Explicitly link to selinux, instead of relying on indirect linkage
  • Update compiler warning flags
  • Fix misc docs comments
  • Don’t assume use of SELinux in virt-sandbox-service
  • Fix path checks for SUSE in virt-sandbox-service
  • Add support for AppArmour profiles
  • Mount /var after other FS to ensure host image is available
  • Ensure state/config dirs can be accessed when QEMU is running non-root for qemu:///system
  • Fix mounting of host images in QEMU sandboxes
  • Mount images as ext4 instead of ext3
  • Allow use of non-raw disk images as filesystem mounts
  • Check if required static libs are available at configure time to prevent silent fallback to shared linking
  • Require libvirt-glib >= 0.2.1
  • Add support for loading lzma and gzip compressed kmods
  • Check for support libvirt URIs when starting guests to ensure clear error message upfront
  • Add LIBVIRT_SANDBOX_INIT_DEBUG env variable to allow debugging of kernel boot messages and sandbox init process setup
  • Add support for exposing block devices to sandboxes with a predictable name under /dev/disk/by-tag/TAGNAME
  • Use devtmpfs instead of tmpfs for auto-populating /dev in QEMU sandboxes
  • Allow setup of sandbox with custom root filesystem instead of inheriting from host’s root.
  • Allow execution of apps from non-matched ld-linux.so / libc.so, eg executing F19 binaries on F22 host
  • Use passthrough mode for all QEMU filesystems

Announce: libvirt-sandbox “Cholistan” 0.5.1 release – an application sandbox toolkit

Posted: November 19th, 2013 | Filed under: Fedora, libvirt, Security, Virt Tools | Tags: , , , , , | No Comments »

I pleased to announce the a new public release of libvirt-sandbox, version 0.5.1, is now available from:

http://sandbox.libvirt.org/download/

The packages are GPG signed with

  Key fingerprint: DAF3 A6FD B26B 6291 2D0E  8E3F BE86 EBB4 1510 4FDF (4096R)

The libvirt-sandbox package provides an API layer on top of libvirt-gobject which facilitates the cration of application sandboxes using virtualization technology. An application sandbox is a virtual machine or container that runs a single application binary, directly from the host OS filesystem. In other words there is no separate guest operating system install to build or manage.

At this point in time libvirt-sandbox can create sandboxes using either LXC or KVM, and should in theory be extendable to any libvirt driver.

This release focused on exclusively on bugfixing

Changed in this release:

  • Fix path to systemd binary (prefers dir /lib/systemd not /bin)
  • Remove obsolete commands from virt-sandbox-service man page
  • Fix delete of running service container
  • Allow use of custom root dirs with ‘virt-sandbox –root DIR’
  • Fix ‘upgrade’ command for virt-sandbox-service generic services
  • Fix logrotate script to use virsh for listing sandboxed services
  • Add ‘inherit’ option for virt-sandbox ‘-s’ security context option, to auto-copy calling process’ context
  • Remove non-existant ‘-S’ option froom virt-sandbox-service man page
  • Fix line break formatting of man page
  • Mention LIBVIRT_DEFAULT_URI in virt-sandbox-service man page
  • Check some return values in libvirt-sandbox-init-qemu
  • Remove unused variables
  • Fix crash with partially specified mount option string
  • Add man page docs for ‘ram’ mount type
  • Avoid close of un-opened file descriptor
  • Fix leak of file handles in init helpers
  • Log a message if sandbox cleanup fails
  • Cope with domain being missing when deleting container
  • Improve stack trace diagnostics in virt-sandbox-service
  • Fix virt-sandbox-service content copying code when faced with non-regular files.
  • Improve error reporting if kernel does not exist
  • Allow kernel version/path/kmod to be set with virt-sandbox
  • Don’t overmount ‘/root’ in QEMU sandboxes by default
  • Fix nosuid / nodev mount options for tmpfs
  • Force 9p2000.u protocol version to avoid QEMU bugs
  • Fix cleanup when failing to start interactive sandbox
  • Create copy of kernel from /boot to allow relabelling
  • Bulk re-indent of code
  • Avoid crash when gateway is missing in network options
  • Fix symlink target created in multi-user.target.wants
  • Add ‘-p PATH’ option for virt-sandbox-service clone/delete to match ‘create’ command option.
  • Only allow ‘lxc:///’ URIs with virt-sandbox-service until further notice
  • Rollback state if cloning a service sandbox fails
  • Add more kernel modules instead of assuming they are all builtins
  • Don’t complain if some kmods are missing, as they may be builtins
  • Allow –mount to be repeated with virt-sandbox-service

Thanks to everyone who contributed to this release

Running a full Fedora OS inside a libvirt LXC guest

Posted: August 12th, 2013 | Filed under: libvirt, Virt Tools | Tags: , , , | 2 Comments »

Historically, running a Linux OS inside an LXC guest, has required a execution of a set of hacky scripts which do a bunch of customizations to the default OS install to make it work in the constrained container environment. One of the many benefits to Fedora, of the switch over to systemd, has been that a default Fedora install has much more sensible behaviour when run inside containers. For example, systemd will skip running udev inside a container since containers do not get given permission to mknod – the /dev is pre-populated with the whitelist of devices the container is allowed to use. As such running Fedora inside a container is really not much more complicated than invoking yum to install desired packages into a chroot, then invoking virt-install to configure the LXC guest.

As a proof of concept, on Fedora 19 I only needed to do the following to setup a Fedora 19 environment suitable for execution inside LXC

 # yum -y --releasever=19 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer \
          --disablerepo='*' --enablerepo=fedora install \
          systemd passwd yum fedora-release vim-minimal openssh-server procps-ng
 # echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty
 # chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root

It would be desirable to avoid the manual editing of /etc/securetty. LXC guests get their default virtual consoles backed by a /dev/pts/0 device, which isn’t listed in the securetty file by default. Perhaps it is a simple as just adding that device node unconditionally. Just have to think about whether there’s a reason to not do that which would impact bare metal. With the virtual root environment ready, now virt-install can be used to configure the container with libvirt

# virt-install --connect lxc:/// --name mycontainer --ram 800 \
              --filesystem /var/lib/libvirt/filesystems/mycontainer,/

virt-install will create the XML config libvirt wants, and boot the guest, opening a connection to the primary text console. This should display boot up messages from the instance of systemd running as the container’s init process, and present a normal text login prompt.

If attempting this with systemd-nspawn command, login would fail because the PAM modules audit code will reject all login attempts. This is really unhelpful behaviour by PAM modules which can’t be disabled by any config, except for booting the entire host with audit=0 which is not very desirable. Fortunately, however, virt-install will configure a separate network namespace for the container by default, which will prevent the PAM module from talking to the kernel audit service entirely, giving it a ECONNREFUSED error. By a stoke of good luck, the PAM modules treat ECONNREFUSED as being equivalent to booting with audit=0, so everything “just works”. This is nice case of two bugs cancelling out to leave no bug :-)

While the above commands are fairly straightforward, it is a goal of ours to simplify live even further, into a single command. We would like to provide a command that looks something like this:

# virt-bootstrap --connect lxc:/// --name mycontainer --ram 800 \
                 --root /var/lib/libvirt/filesystems/mycontainer \
                 --osid fedora19

The idea is that the ‘–osid’ value will be looked up in the libosinfo database. This will have details of the software repository for that OS, and whether it uses yum/apt/ebuild/somethingelse. virt-bootstrap will then invoke the appropriate packaging tool to populate the root filesystem, and then boot the container in one single step.

One final point is that LXC in Fedora still can’t really be considered to be secure without the use of SELinux. The commands I describe above don’t do anything to enable SELinux protection of the container at this time. This is obviously something that ought to be fixed. Separate from this, upstream libvirt now has support for the kernel user namespace feature. This enables plain old the DAC framework to provide a secure container environment. Unfortunately this kernel feature is still not available in Fedora kernel builds. It is blocked on upstream completion of patches for XFS. Fortunately this work seems to be moving forward again, so if we’re lucky it might just be possible to enable user namespaces in Fedora 20, finally making LXC reasonably secure by default even without SELinux.

KVM Forum: building application sandboxes on top of KVM or LXC using libvirt

Posted: November 8th, 2012 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , | No Comments »

This week I have spent my time at LinuxCon Europe and KVM Forum 2012. I gave a talk titled “Building application sandboxes on top of KVM or LXC using libvirt”. For those who enquired afterwards, the slides are now available.

Building application sandboxes with libvirt, LXC & KVM

Posted: January 17th, 2012 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , | 12 Comments »

I have mentioned in passing every now & then over the past few months, that I have been working on a tool for creating application sandboxes using libvirt, LXC and KVM. Last Thursday, I finally got around to creating a first public release of a package that is now called libvirt-sandbox. Before continuing it is probably worth defining what I consider the term “application sandbox” to mean. My working definition is that an “application sandbox” is simply a way to confine the execution environment of an application, limiting the access it has to OS resources. To me one notable point is that there is no need for a separate / special installation of the application to be confined. An application sandbox ought to be able to run any existing application installed in the OS.

Background motivation & prototype

For a few Fedora releases, users have had the SELinux sandbox command which will execute a command with a strictly confined SELinux context applied. It is also able to make limited use of the kernel filesystem namespace feature, to allow changes to the mount table inside the sandbox. For example, the common case is to put in place a different $HOME. The SELinux sandbox has been quite effective, but there is a limit to what can be done with SELinux policy alone, as evidenced by the need to create a setuid helper to enable use of the kernel namespace feature. Architecturally this gets even more problematic as new feature requests need to be dealt with.

As most readers are no doubt aware, libvirt provides a virtualization management API, with support for a wide variety of virtualization technologies. The KVM driver is easily the most advanced and actively developed driver for libvirt with a very wide array of features for machine based virtualization. In terms of container based virtualization, the LXC driver is the most advanced driver in libvirt, often getting new features “for free” since it shares alot of code with the KVM driver, in particular anything cgroup based. The LXC driver has always had the ability to pass arbitrary host filesystems through to the container, and the KVM driver gained similar capabilities last year with the inclusion of support for virtio 9p filesystems. One of the well known security features in libvirt is sVirt, which leverages MAC technology like SELinux to strictly confine the execution environment of QEMU. This has also now been adapted to work for the LXC driver.

Looking at the architecture of the SELinux sandbox command last year, it occurred to me that the core concepts mapped very well to the host filesystem passthrough & sVirt features in libvirt’s KVM & LXC drivers. In other words, it ought to be possible to create application sandboxes using the libvirt API and suitably advanced drivers like KVM or LXC. A few weeks hacking resulted in a proof of concept tool virt-sandbox which can run simple commands in sandboxes built on LXC or KVM.

The libvirt-sandbox API

A command line tool for running applications inside a sandbox is great, but even more useful would be an API for creating application sandboxes that programmers can use directly. While libvirt provides an API that is portable across different virtualization technologies, it cannot magically hide the differences in feature set or architecture between the technologies. Thus the decision was taken to create a new library called libvirt-sandbox that provides a higher level API for managing application sandboxes, built on top of libvirt. The virt-sandbox command from the proof of concept would then be re-implemented using this library API.

The libvirt-sandbox library is built using GObject to enable it to be accessible to any programming language via GObject Introspection. The basic idea is that programmer simply defines the desired characteristics of the sandbox, such as the command to be executed, any arguments, filesystems to be exposed from host, any bind mounts, private networking configuration, etc. From this configuration description, libvirt-sandbox will decide upon & construct a libvirt guest XML configuration that can actually provided the requested characteristics. In other words, the libvirt-sandbox API is providing a layer of policy avoid libvirt, to isolate the application developer from the implementation details of the underlying hypervisor.

Building sandboxes using LXC is quite straightforward, since application confinement is a core competency of LXC. Thus I will move straight to the KVM implementation, which is where the real fun is. Booting up an entire virtual machine probably sounds like quite a slow process, but it really need not be particularly if you have a well constrained hardware definition which avoids any need for probing. People also generally assume that running a KVM guest, means having a guest operating system install. This is absolutely something that is not acceptable for application sandboxing, and indeed not actually necessary. In a nutshell, libvirt-sandbox creates a new initrd image containing a custom init binary. This init binary simply loads the virtio-9p kernel module and then mounts the host OS’ root filesystem as the guest’s root filesystem, readonly of course. It then hands off to a second boot strap process which runs the desired application binary and forwards I/O back to the host OS, until the sandboxed application exits. Finally the init process powers off the virtual machine. To get an idea of the overhead, the /bin/false binary can be executed inside a KVM sandbox with an overall execution time of 4 seconds. That is the total time for libvirt to start QEMU, QEMU to run its BIOS, the BIOS to load the kernel + initrd, the kenrel to boot up, /bin/false to run, and the kernel to shutdown & QEMU to exit. I think 3 seconds is pretty impressive todo all that. This is a constant overhead, so for a long running command like an MP3 encoder, it disappears into the background noise. With sufficient optimization, I’m fairly sure we could get the overhead down to approx 2 seconds.

Using the virt-sandbox command

The Fedora review of the libvirt-sandbox package was nice & straightforward, so the package is already available in rawhide for ready to test the VirtSandbox F17 feature. The virt-sandbox command is provided by the libvirt-sandbox RPM package

# yum install libvirt-sandbox

Assuming libvirt is already installed & able to run either LXC or KVM guests, everything is ready to use immediately.

A first example is to run the ‘/bin/date’ command inside a KVM sandbox:

$ virt-sandbox -c qemu:///session  /bin/date
Thu Jan 12 22:30:03 GMT 2012

You want proof that this really is running an entire KVM guest ? How about looking at the /proc/cpuinfo contents:

$ virt-sandbox -c qemu:///session /bin/cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 2
model name	: QEMU Virtual CPU version 1.0
stepping	: 3
cpu MHz		: 2793.084
cache size	: 4096 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 4
wp		: yes
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good nopl pni cx16 hypervisor lahf_lm
bogomips	: 5586.16
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

How about using LXC instead of KVM, and providing an interactive console instead of just a one-shot command ? Yes, we can do that too:

$ virt-sandbox -c lxc:/// /bin/sh
sh-4.2$ ps -axuwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 165436  3756 pts/0    Ss+  22:31   0:00 libvirt-sandbox-init-lxc
berrange    24  0.0  0.1 167680  4688 pts/0    S+   22:31   0:00 libvirt-sandbox-init-common
berrange    47  0.0  0.0  13852  1608 pts/1    Ss   22:31   0:00  \_ /bin/sh
berrange    48  0.0  0.0  13124   996 pts/1    R+   22:31   0:00      \_ ps -axuwf

Notice how we only see the processes from our sandbox, none from the host OS. There are many more examples I’d like to illustrate, but this post is already far too long.

Future development

This blog post might give the impression that every is complete & operational, but that is far from the truth. This is only the bare minimum functionality to enable some real world usage.  Things that are yet to be dealt with include

  • Write suitable SELinux policy extensions to allow KVM to access host OS filesystems in readonly mode. Currently you need to run in permissive mode which is obviously something that needs solving before F17
  • Turn the virt-viewer command code for SPICE/VNC into a formal API and use that to provide a graphical sandbox running Xorg.
  • Integrate a tool that is able to automatically create sandbox instances for system services like apache to facilitate confined vhosting deployments
  • Correctly propagate exit status from the sandboxed command to the host OS
  • Unentangle stderr and stdout from the sandboxed command
  • Figure out how to make dhclient work nicely when / is readonly and resolv.conf must be updated in-place
  • Expose all the libvirt performance tuning controls to allow disk / net I/O controls, CPU scheduling, NUMA affinity, etc
  • Wire up libvirt’s firewall capability to allow detailed filtering of network traffic to/from sandboxes
  • Much more…

For those attending FOSDEM this year, I will be giving a presentation about libvirt-sandbox in the virt/cloud track.

Oh and as well as the released tar.gz mentioned in the first paragraph, or the Fedora RPM, the  code is all available in GIT