#25 BTRFS for guest VM
Opened 3 years ago by atim. Modified 3 years ago

How good/bad BTRFS for guest VM these days?

  • In which combo could be achieved better performance without losing ability doing snapshots? For example not possible doing snapshots with RAW format, but this could achieved with guest BTRFS snapshots and since RAW format is faster than qcow2 maybe this is a good idea?

  • Also how BTRFS (with compression) performance for guest VM comparable with EXT4 and XFS?

Information over internet about this topic mostly outdated and not relevant these days since many things improved, more recent QEMU versions qcow2 files are much faster, now BTRFS have zstd compression, etc.


Some results on clean Fedora 33 Workstation installation. SATA SSD on host machine. Host FS is BTRFS and nocow attribute used on guest VM image. Copying 3.5 GB directory with various files on guest VM:

  • ext4: took 1m
  • btrfs: took 32s

BTRFS mount options (guest): noatime,compress=zstd:1,space_cache=v2. On BTRFS was used cp --reflink=never in test.

Conclusion (test: copying files): BTRFS with compression on guest virtual machine twice as fast then EXT4. On HDD results could be even better.

I think virtualization folks use qcow2 by default in order to leverage additional features like full VM snapshots that includes the entire machine state. Memory included, not just one file system or the disk. I don't think we'll see the default change to raw. But I think experiments that show suboptimal behaviors are useful for performance enhancement.

HDD's huge latencies will be very sensitive to anything that improves or worsens that latency. If you reduce reads/writes by compression, it'll speed things up quite a bit. But it's also more sensitive to fragmentation. I suspect SSD's are generally OK with sparse files for VM backing, despite some seemingly significant fragmentation. Whereas this could really clobber performance on a hard drive, where you'd want to fallocate the backing file (fallocate is the default qcow2 behavior for virt-manager, but sparse is the default for GNOME Boxes).

Another variable is the guest file system write behavior. Windows/NTFS is really aggressive, resulting in many fragments in a short amount of time. The worst of this is mitigated by nodatacow but with sparse files it's still much worse than other file systems.

And yet another variable is the caching scheme. I think most everyone these days defaults to writeback caching. This is OK. I personally prefer taking the risk with the unsafe cache option, which is safe if the guest crashes, but quite unsafe for the guest if the host crashes while the guest is writing. It surely can't be made the default, no matter the significant performance gain.

Found one case which you want to avoid probably if you using BTRFS on guest VM which stored on BTRFS host HDD, check video demo (guest f33 boot process).

In such scenario we have abnormal terrible performance and constant HDD thrashing. Guest f33 boot time about 6 minutes. And i guess things will get worse in the future when VM become more fragmented. nocow attribute used on guest VM.

Mount options: zstd:1,space_cache=v2 compression used on guest and host FS.

@atim How do we avoid it? There's no info on the configuration or how to reproduce the problem. Feel free to open a RHBZ component kernel, assignee fedora-kernel-btrfs@fedoraproject.org

Include info about the host configuration (cpu, memory, swap, mount options for / and for the libvirt storage pool file systems). For the guest, easiest to virsh dumpxml <domain> to a file and attach the file. Also include guest file system mount options. In particular note any deviation from defaults.

How do we avoid it?

I thought this is natural in such case just wanted to document maybe or put in btrfs gotchas. To avoid this just no need to use btrfs on guest VM if it stored on HDD drive.

BTW this is the same VM which was previously on SSD drive and btrfs host but it worked absolutely fine i didn't noticed any performance issues on SSD, only when moved it on HDD.

Include info about the host configuration <snip> In particular note any deviation from defaults.

This VM and host partition already nuked, sorry. if i knew earlier this... But i've added to previous post deviation from defaults and mount options in particular.

I get ~12s boot on SSD, and ~22s boot on HDD. Fedora 33 as host and guest, default/automatic partitioning for both. So there's some other factor going on, that results in 6 minute boot times.

I'll try to experiment and repeat this scenario in near future. Also now i remembered that i didn't noticed this right after transfer my VM from SSD to HDD. Probably it takes some time until it become heavily fragmented. Or maybe there some other factor, yep. But performance become really horrible, literally unusable.

Possibly the biggest single optimization to avoid fragmentation of the image file is to preallocate it. i.e. CLI use fallocate for raw, and for qemu-img use option preallocation=falloc. It's fast, doesn't write zeros, and is the GNOME Boxes default. It's not the default for virt-manager, which is using preallocation=metadata unless you check the "Allocate entire volume now" option when creating the raw or qcow2.

Compression in the guest does help performance overall by reducing IO and the propensity of rotational and seek latencies. It's possible to quantify the latencies with bcc-tools. See fileslower and biosnoop - you'll be able to see the effect of seek latency in particular. And those latencies are definitely exacerbated by sparse images on HDD.

The second biggest factor which becomes the biggest once you preallocate, is the cache mode. For sure unsafe is better. It's used in Fedora infrastructure because it's that much faster, but they also take precautions so that the host isn't crashing. My experience is the guest can crash or be force quit with impunity and neither guest or host Btrfs care. However if the host abruptly dies, it can make the guest Btrfs non-reparable depending on what's damaged. For obvious reasons we can't use something called unsafe by default.

Login to comment on this ticket.

Metadata