On my Intel i7 laptop, xxhash is a small but clear performance win over crc32c:
$ ./hash-speedtest 10000000 Block size: 4096 Iterations: 10000000 Implementation: builtin NULL-NOP: cycles: 1372543560, c/i 137 NULL-MEMCPY: cycles: 2844174884, c/i 284 CRC32C: cycles: 9673117404, c/i 967 XXHASH: cycles: 7129819594, c/i 712 SHA256: cycles: 649914613520, c/i 64991 BLAKE2b: cycles: 153513008046, c/i 15351
And I'm given to understand that this is even more the case on newer CPUs.
Plus, it's 64 bit instead of 32 bit. The 256-bit algorithms are obviously much, much slower and probably not right for a default, but should we consider making xxhash the default for Fedora Linux systems with btrfs?
@josef, @dcavalca, @salimma: Any thoughts here? It seems reasonable for us to change the default in F34+ with the compression change.
Metadata Update from @ngompa: - Issue tagged with: Cloud, Desktop, Kernel, Server, Utils
Metadata Update from @ngompa: - Issue untagged with: Cloud, Desktop, Server
FWIW on an AMD Ryzen 7, similar resullts:
$ ./hash-speedtest 10000000 Block size: 4096 Iterations: 10000000 Implementation: builtin NULL-NOP: cycles: 3277089704, c/i 327 NULL-MEMCPY: cycles: 5527082126, c/i 552 CRC32C: cycles: 20922943927, c/i 2092 XXHASH: cycles: 13123907067, c/i 1312 SHA256: cycles: 876548520868, c/i 87654 BLAKE2b: cycles: 200406665246, c/i 20040
how could I test this? I have an atom avoton server with btrfs storage, would be great to know the performance there.
The hash-speedtest program is part of btrfs-progsbut not built by default for some reason. I downloaded the source and ran ./configure; make hash-speedtest to make it.
btrfs-progs
./configure; make hash-speedtest
Unfortunately to test the actual effect in use rather than benchmark, you need to create an all-new filesystem... it can't be changed after the fact.
I don't have a strong feeling either way about this, my only question was support in grub2 for this, but it appears to just ignore csums completely, so hooray? The hash-speedtest is a solid tool for testing on a variety of hardware, if you guys feel comfortable changing it then by all means go for it.
hash-speedtest
Oh also I noticed somebody making a comment about size considerations, we actually reserve 32 bytes for the csum, we just only use 4 for crc32c or 8 for xxhash. The actual size usage on disk remains the same no matter which hash you choose for your csums.
Huh, so, theoretically, it would be possible to change hash algorithms after the fact by recalculating the entire filesystem?
Sure, theoretically you could re-hash everything offline pretty easily.
How much consideration should be given to the fact such file systems will not mount with kernel 5.4 and older? i.e. RHEL/CentOS, Debian, Ubuntu; Arch, Gentoo, Tumbleweed will be able to mount.
I'm inclined to think this should follow the change process, and work out the various issues we may not be considering.
Yeah, I don't think this should be slipped in as part of the existing change, and it's too late for F34 for something that isn't at the level of getting an exception.
I tend to agree with @chrismurphy here. While this does look fairly straightforward, it's still a change with visible implications that should be communicated to users.
Metadata Update from @ngompa: - Issue set to the milestone: Future Release
Sorry @chrismurphy made me realize I'm a giant idiot. We reserve 32bytes in the metadata for the hash, so we don't take any more space for checksumming metadata bytes, but it will of course double the size used for data checksums, which would need to be taken into account. Sorry about that, not sure where my head was.
FWIW AMD Ryzen Threadripper PRO 3945WX s (24) @ 4.000GHz:
$ ./hash-speedtest 10000000 Block size: 4096 Iterations: 10000000 Implementation: builtin NULL-NOP: cycles: 2065484440, c/i 206 NULL-MEMCPY: cycles: 4363066160, c/i 436 CRC32C: cycles: 16103553000, c/i 1610 XXHASH: cycles: 11981331480, c/i 1198 SHA256: cycles: 899613816040, c/i 89961 BLAKE2b: cycles: 195169520400, c/i 19516
Figure 1mib of csums per 1gig of data for crc32c, 2mib for xxhash, which means ~128 leaves for crc32c, ~256 leaves for xxhash, which translates to maybe one extra node? To make it easier to visualize you can do
What does this mean as far as speed tho? Probably not too much, generally speaking if you dd a giant file you're going to be ratelimited by the amount of memory on the system, so you may not end up generating 2x the amount of metadata writes, maybe only 1.5x. But likely more than baseline, so it becomes a question of trading CPU time for disk time. For us (Facebook) we'll happily trade cycles on the CPU for disk, because the CPU is far faster and far more predictable than the disk. This is why we default to -o force-compress=ztsd:3, the space savings are nice of course, but really the latency/performance boost from not writing as much is the much larger upside for us.
-o force-compress=ztsd:3
Again, I have no strong feelings here, just providing more data and insight, whatever you guys want to do I'm cool with, I can seen benefits with both options.
I'd be curious to know what the impact would be here for ARM. In general, ARM systems tend to be measurably weaker, would this have a significant positive/negative impact there?
I'm not sure. Unfortunately btrfs-progs' hash-speedtest does not compile on ARM.
Yeah, it's got some stuff in assembly to read cpu cycle counts.
I've reached out to arm to ask if someone can look at it and submit a patch upstream and generalise it.
Once that happens I can run some tests across my range of edge HW which includes lower end x86 Atom as well as various generations of Arm HW. I'll post an update here when I have been able to do so.
@pbrobinson That would be awesome! Thanks for that!
So two arm engineers kindly took a quick look at this for me and provided me some feedback. Notes below including my own pieces of digging and testing.
First thing to note is that the hash-speedtest program uses the locally included userspace versions of hashes, in the case of x86 it references using the SSE4.2 crc32 instruction[1], but the code is not the kernel implementations of the hashes which is what the actual btrfs filesystem uses and the kenrel implementations likely had a lot more focus on tuning/optimisation.
Second is that there's note is that Linux does not have any architecture optimized xxhash code, it is all pure C so any architecture should be about equal. For crc32c there's various arch optimised values, x86 has an instruction in SSE4.2+, the arm implementations can make use of HW crc32c instructions, which RPi3 has, but some older ones don't but that's probably not too much of an issue, no clue on POWER/Z-series.
The benchmark could be rewritten to use the AF_ALG interface which makes use of the kernel variants but the data sizes will need to be >4k otherwise the syscall overhead is going to be a significant percentage of the overhead and throw the benchmark off and wouldn't be a real world idicative test for storage.
Digging into tests myself I couldn't see how to test basic AF_ALG hash support for crc32 or xxhash through either the openssl or libkcapi-tools, which are means to access the kernel crytpo framework, even though I can see them in /proc/crypto and it seems it's possible[2]. I'm still digging into tools that can give us some actual real world data on the kernel hash algorithm throughput that can be run across a range of devices to determine the best outcome here.
Overall I don't think the hash-speedtest tool is actually indicative of the actual performance you'll necessarily get on the actual filesystem, which is probably why it's not built by default, and I personally don't believe we should be using it to make decisions on the btrfs hash.
[1] https://github.com/kdave/btrfs-progs/blob/master/crypto/crc32c.c#L30 [2] https://gist.github.com/robn/7e1f27963710c17eda0f
In typical fashion right after I post this.... I think we may be able to get more general tests with the fio tool as it's designed for. I've not spend any time actually running tests with this. It has a vast array of options for generating load/threads etc.
A no frills test on my laptop:
$ fio --crctest= fio: unknown hash ``. Available: md5 crc64 crc32 crc32c crc16 crc7 sha1 sha256 sha512 xxhash murmur3 jhash fnv sha3-224 sha3-256 sha3-384 sha3-512 [root@laptop]# fio --crctest=crc32c crc32c: 9887.61 MiB/sec [root@laptop]# fio --crctest=xxhash xxhash: 7396.92 MiB/sec
So some basic testing across a number of devices with just "fio --crctest=crc32c; fio --crctest=xxhash", if someone wants to put together a more indicative storage throughput test I can retest, but this should just be purely cpu/memory throughput and should be indepentent of the actual storage, a lot of the devices below have storage slow enough (microSD/eMMC) that I doubt either hash perf would have any real world impact.
I think the summary here is that Intel/aarch64 systems tend to actually be faster with crc32c than xxhash, on 32 bit arm platforms where there's no HW crc32c it's slower, but we also tend not to focus on desktop environments there anymore so btrfs is less common.
Other notes: It's no shock that the Mustang is slow, it's pre crc32c HW on aarch64 (very first aarch64 device). Yes, it amuses me that it crashes for crc32c on the RPi devices: https://bugzilla.redhat.com/show_bug.cgi?id=1962050
x86_64 devices: UpSquared (Atom Apollo Lake): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 6175.82 MiB/sec xxhash: 3090.70 MiB/sec
Compulabs Fitlet2 (Atom Apollo Lake): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 4922.89 MiB/sec xxhash: 2475.89 MiB/sec
Lenovo Carbon X1 gen 6 (i7-8650U): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 8124.66 MiB/sec xxhash: 6982.90 MiB/sec
aarch64 devices: NVIDIA Jetson Xavier AGX: $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 5937.06 MiB/sec xxhash: 3161.35 MiB/sec
NVIDIA Jetson Nano: $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 7506.67 MiB/sec xxhash: 3436.10 MiB/sec
Solid Run Honeycomb - NXP Layerscape 2160A: $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 10765.35 MiB/sec xxhash: 4069.82 MiB/sec
Raspberry Pi4: $ fio --crctest=crc32c; fio --crctest=xxhash Illegal instruction (core dumped) xxhash: 2774.11 MiB/sec
Raspberry Pi3B+: $ fio --crctest=crc32c; fio --crctest=xxhash Illegal instruction (core dumped) xxhash: 1612.94 MiB/sec
Solid Run MacBin: $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 8270.07 MiB/sec xxhash: 2984.97 MiB/sec
NVIDIA Jetson TX2: $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 7266.12 MiB/sec xxhash: 3473.92 MiB/sec
Rock960 (Rockchip rk3399): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 8777.04 MiB/sec xxhash: 3335.90 MiB/sec
Old aarch64/armv7 platforms without crc32c HW instructions: Mustang (x-gene 1): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 251.90 MiB/sec xxhash: 1838.27 MiB/sec
Panda-ES (omap4) $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 79.68 MiB/sec xxhash: 365.88 MiB/sec
BeagleBlack (am33xx): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 28.20 MiB/sec xxhash: 90.72 MiB/sec
Wandboard (i.MX6): $ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 85.02 MiB/sec xxhash: 393.08 MiB/sec
Wow, this is some fantastic insight. It seems like the hit is pretty significant for switching the xxhash across the board, even on x86_64, because the kernel lacks hardware optimized variants of xxhash.
With this information in mind, do we want to continue considering xxhash?
And crc32c also wins on my AMD Threadripper system:
$ fio --crctest=crc32c; fio --crctest=xxhash crc32c: 10890.37 MiB/sec xxhash: 8065.02 MiB/sec
although both of these numbers are at the "wow that's fast" level so I don't mind having picked xxhash for its length for my own system.
I'd be curious if sha256 would be fast, given that it usually has hardware acceleration, right? Or am I misremembering?
The answer is basically considerably slower:
fio --crctest=sha256
Carbon X1: sha256: 60.89 MiB/sec
UpSquared: sha256: 99.63 MiB/sec
Jetson AGX: sha256: 170.08 MiB/sec
Honeycomb: sha256: 137.73 MiB/sec
Rock960: sha256: 112.98 MiB/sec
Amused that the cheap arm device is almost double the speed than the Thinkpad, interesting the Atom is 1.5 times. Wonder if that's because of LUKS/VPN and other crytpo on my laptop.
I would guess blake2b would see a similar performance dip, as it was a SHA-3 contestant...?
You could run it locally yourself ;-)
There's definitely HW accelerated SHA-3 but I'm not sure how widely deployed it is. I can add it to the todo list.
fio doesn't seem to have a blake2b option. But:
$ fio --crctest=crc32c; fio --crctest=xxhash; fio --crctest=sha3-224 crc32c: 10964.07 MiB/sec xxhash: 8108.71 MiB/sec sha3-224: 81.50 MiB/sec
Only 100x slower than xxhash on my AMD system. :)
@josef Do you know if there's any efforts going on for hardware-accelerated xxhash implementations in the kernel?
it's possible to compile hash-speedtest using --with-crypto=libkcapi which uses the kernel crypto API instead of built-in, which is the default.
Last I checked it makes a pretty big difference with sha256. But also we're not getting hardware accelerated sha256 with Fedora kernels because the generic module is what gets loaded by default. I think it's because btrfs is built-in, while CONFIG_CRYPTO_SHA256_SSSE3=m and the API isn't smart enough to use generic for starters, but then later switch to SSSE3 once available.
BLAKE was a SHA-3 finalist and ~3x faster than Keccak which was finally selected. BLAKE2 is even faster, beating SHA-256 by quite a lot. But SHA-3 isn't supported in Btrfs nor is xxHash-3, the successor to xxhash64. Note that a using a checksum other than crc32c means a btrfs file system won't be mountable with a kernel older than 5.5. This isn't a big concern in Fedora land but it might be a concern for downstreams.
xxhash64 was picked mainly to mitigate collision potential compared to that of crc32c while not being meaningfully computationally expensive. So maybe Fedora could lead on determining a threshold for when to use xxhash by default? Collision potential increases as file system size increases, so maybe mkfs should flip to xxhash at a certain block device size? Since Btrfs trivially grows just by adding devices, the mkfs threshold should probably flip the default to xxhash well before there's some meaninful chance of collisions. Maybe 2TiB?
xxhash is already optimized for CPUs, utilizing instruction parallelism, so there's not much chance to make it better as it's not standardized and widely used, compared to crc32c.
Recent benchmarks https://github.com/kdave/btrfs-progs/issues/548#issuecomment-2058890102
And recent update on adding more algorithms, e.g. hhX3 https://github.com/kdave/btrfs-progs/issues/548#issuecomment-2064050115
I think it's unlikely Fedora will deviate from upstream, so I'm not sure we need to keep this issue open.
Note, there is experimental support for checksum conversion mentioned later in the thread.
Log in to comment on this ticket.