The surprisingly complicated world of disk image sizes

Posted: February 10th, 2017 | Author: | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: , , | No Comments »

When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:

  • Capacity – the size that is visible to the guest OS
  • Length – the current highest byte offset in the file.
  • Allocation – the amount of storage that is currently consumed.
  • Commitment – the amount of storage that could be consumed in the future.

The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.

Raw files

In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.

While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.

QCow2 files

In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.

Summary

Considering the above information, for a newly created file the four size values would look like

Format Capacity Length Allocation Commitment
raw (sparse) 40GB 40GB 0 40GB [1]
raw (prealloc) 40GB 40GB 40GB [1] 40GB [1]
qcow2 (grow on demand) 40GB 193KB 196KB 41GB [2]
qcow2 (prealloc metadata) 40GB 41GB 6.5MB 41GB [2]
qcow2 (prealloc all) 40GB 41GB 41GB 41GB [2]
[1] XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB
[2] use of internal snapshots may massively increase allocation/commitment

For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.

That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.

QEMU QCow2 built-in encryption: just say no. Deprecated now, to be deleted soon

Posted: March 17th, 2015 | Author: | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security | Tags: , , | 5 Comments »

A little over 5 years ago now, I wrote about a how libvirt introduced support for QCow2 built-in encryption. The use cases for built-in qcow2 encryption were compelling back then, and remain so today. In particular while LUKS is fine if your disk backend is already a kernel visible block device, it is not a generically usable alternative for QEMU since it requires privileged operation to set it up, would require yet another I/O layer via a loopback or qemu-nbd device, and finally is entirely Linux specific. The direction QEMU has taken over the past few years has in fact been to take the kernel out of the equation for more & more functionality. For example, QEMU can now natively connect to RBD, Gluster, iSCSI and NFS servers with no kernel assistance – the client code is implemented entirely within QEMU block driver layer, which precludes the use of LUKS there.

At the time I wrote that blog post, no one had seriously looked at the QCow2 encryption design to see if it was in any way sane from a security POV. At least if they had, AFAIK, they didn’t make their analysis public. Over time though, various QEMU maintainers did eventually look at the QCow2 encryption code and their conclusions were not positive. The consensus opinion amongst QEMU maintainers today is that QCow2 encryption is terminally flawed in a number of serious ways, including but not limited to:

  • The AES-CBC cipher is used with predictable initialization vectors based on the sector number. This makes it vulnerable to chosen plaintext attacks which can reveal the existence of encrypted data.
  • The user passphrase is directly used as the encryption key.
    • A poorly chosen or short passphrase will compromise the security of the encryption.
    • In the event of the passphrase being compromised there is no way to change the passphrase to protect data in any qcow images.
    • It is difficult to make the data permanently inaccessible upon file deletion – at best you can try to destroy data with shred, though even this is ineffective with many modern storage technologies.

By comparison the LUKS encryption format does not suffer from any of these problems. With LUKS the initialization vectors typically use ESSIV to ensure unpredictability; the passphrase is only indirectly used to unlock the master encryption key material, so can be changed at will; the passphrase is put through a PBKDF2 function to mitigate the effects of short sequences of letters; the size of the master key material is artificially inflated with an anti-forensic algorithm to increase the difficulty of recovering the key from deleted volumes.

The QCow2 encryption scheme is a prime example of why merely using a well known standard algorithm (AES) is not sufficient to guarantee a secure implementation. In January 2014, I submitted an update for the QEMU docs to explicitly warn users about the security limitations of QCow2 encryption, which made it into the 1.5.0 release of QEMU. This week Markus has gone one step further and explicitly deprecated use of QCow2 encryption for the forthcoming 2.3.0 release of QEMU. Any attempt to use an encrypted QCow2 file with the QEMU system emulator will now result in a warning being printed to stderr, which in turn ends up in the libvirt logfile for that guest. As well as the security issues, Markus’ other motivation for deprecating this is that the way it is integrated into QEMU block driver layer causes a number of technical & usability problems. So even if we want encrypted block devices in QEMU, the internals for encryption need a complete rewrite from scratch.

In the 2.4.0 release, the intention is to go one step further and actually delete support for QCow2 encryption from the QEMU system emulator entirely, as well as all the infrastructure for block device encryption. We will keep support for decrypting images in the qemu-img program only, to provide a way for users to get their previously encrypted data out into a supported format.

In the immediate future, the recommendation is that users who need encryption for virtual disks should use LUKS on the host, despite the limitations that I noted earlier. At some point in the next 6 months my intention is to start working on a QEMU block driver implementation of the LUKS format, which will enable QEMU to add encryption to any of its virtual disk backends, not merely QCow2. This will require designing new infrastructure for handling decryption keys inside QEMU too, to replace the unsatisfactory approach used today. By using the LUKS format directly though, QEMU will benefit from the security knowledge of those who designed and analysed this format over many years to identify its strengths & weaknesses. It will also provide good interoperability. eg an encrypted qcow2-luks file will be able to be converted to/from a block device for access by the kernel’s LUKS driver with no need to re-encrypt the data, which is very desirable as it lets users decide whether to use in-QEMU or in-kernel block device backends at the flick of a switch.

So just to sum up. Do not ever use QCow2 built-in encryption as it exists today. It is unfixably broken by design. It is deprecated in QEMU 2.3.0 and is likely to be deleted in QEMU 2.4.0.