Running a full Fedora OS inside a libvirt LXC guest

Posted: August 12th, 2013 | Filed under: libvirt, Virt Tools | Tags: , , , | 2 Comments »

Historically, running a Linux OS inside an LXC guest, has required a execution of a set of hacky scripts which do a bunch of customizations to the default OS install to make it work in the constrained container environment. One of the many benefits to Fedora, of the switch over to systemd, has been that a default Fedora install has much more sensible behaviour when run inside containers. For example, systemd will skip running udev inside a container since containers do not get given permission to mknod – the /dev is pre-populated with the whitelist of devices the container is allowed to use. As such running Fedora inside a container is really not much more complicated than invoking yum to install desired packages into a chroot, then invoking virt-install to configure the LXC guest.

As a proof of concept, on Fedora 19 I only needed to do the following to setup a Fedora 19 environment suitable for execution inside LXC

 # yum -y --releasever=19 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer \
          --disablerepo='*' --enablerepo=fedora install \
          systemd passwd yum fedora-release vim-minimal openssh-server procps-ng
 # echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty
 # chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root

It would be desirable to avoid the manual editing of /etc/securetty. LXC guests get their default virtual consoles backed by a /dev/pts/0 device, which isn’t listed in the securetty file by default. Perhaps it is a simple as just adding that device node unconditionally. Just have to think about whether there’s a reason to not do that which would impact bare metal. With the virtual root environment ready, now virt-install can be used to configure the container with libvirt

# virt-install --connect lxc:/// --name mycontainer --ram 800 \
              --filesystem /var/lib/libvirt/filesystems/mycontainer,/

virt-install will create the XML config libvirt wants, and boot the guest, opening a connection to the primary text console. This should display boot up messages from the instance of systemd running as the container’s init process, and present a normal text login prompt.

If attempting this with systemd-nspawn command, login would fail because the PAM modules audit code will reject all login attempts. This is really unhelpful behaviour by PAM modules which can’t be disabled by any config, except for booting the entire host with audit=0 which is not very desirable. Fortunately, however, virt-install will configure a separate network namespace for the container by default, which will prevent the PAM module from talking to the kernel audit service entirely, giving it a ECONNREFUSED error. By a stoke of good luck, the PAM modules treat ECONNREFUSED as being equivalent to booting with audit=0, so everything “just works”. This is nice case of two bugs cancelling out to leave no bug :-)

While the above commands are fairly straightforward, it is a goal of ours to simplify live even further, into a single command. We would like to provide a command that looks something like this:

# virt-bootstrap --connect lxc:/// --name mycontainer --ram 800 \
                 --root /var/lib/libvirt/filesystems/mycontainer \
                 --osid fedora19

The idea is that the ‘–osid’ value will be looked up in the libosinfo database. This will have details of the software repository for that OS, and whether it uses yum/apt/ebuild/somethingelse. virt-bootstrap will then invoke the appropriate packaging tool to populate the root filesystem, and then boot the container in one single step.

One final point is that LXC in Fedora still can’t really be considered to be secure without the use of SELinux. The commands I describe above don’t do anything to enable SELinux protection of the container at this time. This is obviously something that ought to be fixed. Separate from this, upstream libvirt now has support for the kernel user namespace feature. This enables plain old the DAC framework to provide a secure container environment. Unfortunately this kernel feature is still not available in Fedora kernel builds. It is blocked on upstream completion of patches for XFS. Fortunately this work seems to be moving forward again, so if we’re lucky it might just be possible to enable user namespaces in Fedora 20, finally making LXC reasonably secure by default even without SELinux.

A new (configurable) cgroups layout for libvirt with QEMU, KVM & LXC

Posted: May 13th, 2013 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: , , , , | 1 Comment »

Several years ago I wrote a bit about libvirt and cgroups in Fedora 12. Since that time, much has changed, and we’ve learnt alot about the use of cgroups, not all of it good.

Perhaps the biggest change has been the arrival of systemd, which has brought cgroups to the attention of a much wider audience. One of the biggest positive impacts of systemd on cgroups, has been a formalization of how to integrate with cgroups as an application developer. Libvirt of course follows these cgroups guidelines, has had input into their definition & continues to work with the systemd community to improve them.

One of the things we’ve learnt the hard way is that the kernel implementation of control groups is not without cost, and the way applications use cgroups can have a direct impact on the performance of the system. The kernel developers have done a great deal of work to improve the performance and scalability of cgroups but there will always be a cost to their usage which application developers need to be aware of. In broad terms, the performance impact is related to the number of cgroups directories created and particularly to their depth.

To cut a long story short, it became clear that the directory hierarchy layout libvirt used with cgroups was seriously sub-optimal, or even outright harmful. Thus in libvirt 1.0.5, we introduced some radical changes to the layout created.

Historically libvirt would create a cgroup directory for each virtual machine or container, at a path $LOCATION-OF-LIBVIRTD/libvirt/$DRIVER-NAME/$VMNAME. For example, if libvirtd was placed in /system/libvirtd.service, then a QEMU guest named “web1” would live at /system/libvirtd.service/libvirt/qemu/web1. That’s 5 levels deep already, which is not good.

As of libvirt 1.0.5, libvirt will create a cgroup directory for each virtual machine or container, at a path /machine/$VMNAME.libvirt-$DRIVER-NAME. First notice how this is now completely disassociated from the location of libvirtd itself. This allows the administrator greater flexibility in controlling resources for virtual machines independently of system services. Second notice that the directory hierarchy is only 2 levels deep by default, so a QEMU guest named “web” would live at /machine/web1.libvirt-qemu

The final important change is that the location of virtual machine / container can now be configured on a per-guest basis in the XML configuration, to override the default of /machine. So if the guest config says

  <resource>
    <partition>/virtualmachines/production</partition>
  </resource>

then libvirt will create the guest cgroup directory /virtualmachines.partition/production.partition/web1.libvirt-qemu. Notice that there will always be a .partition suffix on these user defined directories. Only the default top level directories /machine, /system and /user will be without a suffix. The suffix ensures that user defined directories can never clash with anything the kernel will create. The systemd PaxControlGroups will be updated with this & a few escaping rules soon.

There is still more we intend todo with cgroups in libvirt, in particular adding APIs for creating & managing these partitions for grouping VMs, so you don’t need to go to a tool outside libvirt to create the directories.

One final thing, libvirt now has a bit of documentation about its cgroups usage which will serve as the base for future documentation in this area.