#40 Consider defaulting to xxhash
Opened 4 years ago by mattdm. Modified 4 months ago

On my Intel i7 laptop, xxhash is a small but clear performance win over crc32c:

$ ./hash-speedtest  10000000                                                                                                                                           
Block size:     4096                                                                                                                                                   
Iterations:     10000000                                                                                                                                               
Implementation: builtin

    NULL-NOP: cycles:   1372543560, c/i      137                                                                                                                       
 NULL-MEMCPY: cycles:   2844174884, c/i      284                                                                                                                       
      CRC32C: cycles:   9673117404, c/i      967                                                                                                                       
      XXHASH: cycles:   7129819594, c/i      712                                                                                                                       
      SHA256: cycles: 649914613520, c/i    64991                                                                                                                       
     BLAKE2b: cycles: 153513008046, c/i    15351

And I'm given to understand that this is even more the case on newer CPUs.

Plus, it's 64 bit instead of 32 bit. The 256-bit algorithms are obviously much, much slower and probably not right for a default, but should we consider making xxhash the default for Fedora Linux systems with btrfs?


@josef, @dcavalca, @salimma: Any thoughts here? It seems reasonable for us to change the default in F34+ with the compression change.

Metadata Update from @ngompa:
- Issue tagged with: Cloud, Desktop, Kernel, Server, Utils

4 years ago

Metadata Update from @ngompa:
- Issue untagged with: Cloud, Desktop, Server

4 years ago

FWIW on an AMD Ryzen 7, similar resullts:

$ ./hash-speedtest 10000000
Block size:     4096
Iterations:     10000000
Implementation: builtin

    NULL-NOP: cycles:   3277089704, c/i      327
 NULL-MEMCPY: cycles:   5527082126, c/i      552
      CRC32C: cycles:  20922943927, c/i     2092
      XXHASH: cycles:  13123907067, c/i     1312
      SHA256: cycles: 876548520868, c/i    87654
     BLAKE2b: cycles: 200406665246, c/i    20040

how could I test this? I have an atom avoton server with btrfs storage, would be great to know the performance there.

The hash-speedtest program is part of btrfs-progsbut not built by default for some reason. I downloaded the source and ran ./configure; make hash-speedtest to make it.

Unfortunately to test the actual effect in use rather than benchmark, you need to create an all-new filesystem... it can't be changed after the fact.

I don't have a strong feeling either way about this, my only question was support in grub2 for this, but it appears to just ignore csums completely, so hooray? The hash-speedtest is a solid tool for testing on a variety of hardware, if you guys feel comfortable changing it then by all means go for it.

Oh also I noticed somebody making a comment about size considerations, we actually reserve 32 bytes for the csum, we just only use 4 for crc32c or 8 for xxhash. The actual size usage on disk remains the same no matter which hash you choose for your csums.

Oh also I noticed somebody making a comment about size considerations, we actually reserve 32 bytes for the csum, we just only use 4 for crc32c or 8 for xxhash. The actual size usage on disk remains the same no matter which hash you choose for your csums.

Huh, so, theoretically, it would be possible to change hash algorithms after the fact by recalculating the entire filesystem?

Sure, theoretically you could re-hash everything offline pretty easily.

How much consideration should be given to the fact such file systems will not mount with kernel 5.4 and older? i.e. RHEL/CentOS, Debian, Ubuntu; Arch, Gentoo, Tumbleweed will be able to mount.

I'm inclined to think this should follow the change process, and work out the various issues we may not be considering.

Yeah, I don't think this should be slipped in as part of the existing change, and it's too late for F34 for something that isn't at the level of getting an exception.

I tend to agree with @chrismurphy here. While this does look fairly straightforward, it's still a change with visible implications that should be communicated to users.

Metadata Update from @ngompa:
- Issue set to the milestone: Future Release

4 years ago

Sorry @chrismurphy made me realize I'm a giant idiot. We reserve 32bytes in the metadata for the hash, so we don't take any more space for checksumming metadata bytes, but it will of course double the size used for data checksums, which would need to be taken into account. Sorry about that, not sure where my head was.

FWIW AMD Ryzen Threadripper PRO 3945WX s (24) @ 4.000GHz:

$ ./hash-speedtest 10000000
Block size:     4096
Iterations:     10000000
Implementation: builtin

    NULL-NOP: cycles:   2065484440, c/i      206
 NULL-MEMCPY: cycles:   4363066160, c/i      436
      CRC32C: cycles:  16103553000, c/i     1610
      XXHASH: cycles:  11981331480, c/i     1198
      SHA256: cycles: 899613816040, c/i    89961
     BLAKE2b: cycles: 195169520400, c/i    19516

Figure 1mib of csums per 1gig of data for crc32c, 2mib for xxhash, which means ~128 leaves for crc32c, ~256 leaves for xxhash, which translates to maybe one extra node? To make it easier to visualize you can do

data size crc32c xxhash
1gib 1mib 2mib
100gib 100mib 200mib
1tib 1gib 2gib

What does this mean as far as speed tho? Probably not too much, generally speaking if you dd a giant file you're going to be ratelimited by the amount of memory on the system, so you may not end up generating 2x the amount of metadata writes, maybe only 1.5x. But likely more than baseline, so it becomes a question of trading CPU time for disk time. For us (Facebook) we'll happily trade cycles on the CPU for disk, because the CPU is far faster and far more predictable than the disk. This is why we default to -o force-compress=ztsd:3, the space savings are nice of course, but really the latency/performance boost from not writing as much is the much larger upside for us.

Again, I have no strong feelings here, just providing more data and insight, whatever you guys want to do I'm cool with, I can seen benefits with both options.

I'd be curious to know what the impact would be here for ARM. In general, ARM systems tend to be measurably weaker, would this have a significant positive/negative impact there?

I'm not sure. Unfortunately btrfs-progs' hash-speedtest does not compile on ARM.

I'm not sure. Unfortunately btrfs-progs' hash-speedtest does not compile on ARM.

Yeah, it's got some stuff in assembly to read cpu cycle counts.

I've reached out to arm to ask if someone can look at it and submit a patch upstream and generalise it.

Once that happens I can run some tests across my range of edge HW which includes lower end x86 Atom as well as various generations of Arm HW. I'll post an update here when I have been able to do so.

@pbrobinson That would be awesome! Thanks for that!

So two arm engineers kindly took a quick look at this for me and provided me some feedback. Notes below including my own pieces of digging and testing.

First thing to note is that the hash-speedtest program uses the locally included userspace versions of hashes, in the case of x86 it references using the SSE4.2 crc32 instruction[1], but the code is not the kernel implementations of the hashes which is what the actual btrfs filesystem uses and the kenrel implementations likely had a lot more focus on tuning/optimisation.

Second is that there's note is that Linux does not have any architecture optimized xxhash code, it is all pure C so any architecture should be about equal. For crc32c there's various arch optimised values, x86 has an instruction in SSE4.2+, the arm implementations can make use of HW crc32c instructions, which RPi3 has, but some older ones don't but that's probably not too much of an issue, no clue on POWER/Z-series.

The benchmark could be rewritten to use the AF_ALG interface which makes use of the kernel variants but the data sizes will need to be >4k otherwise the syscall overhead is going to be a significant percentage of the overhead and throw the benchmark off and wouldn't be a real world idicative test for storage.

Digging into tests myself I couldn't see how to test basic AF_ALG hash support for crc32 or xxhash through either the openssl or libkcapi-tools, which are means to access the kernel crytpo framework, even though I can see them in /proc/crypto and it seems it's possible[2]. I'm still digging into tools that can give us some actual real world data on the kernel hash algorithm throughput that can be run across a range of devices to determine the best outcome here.

Overall I don't think the hash-speedtest tool is actually indicative of the actual performance you'll necessarily get on the actual filesystem, which is probably why it's not built by default, and I personally don't believe we should be using it to make decisions on the btrfs hash.

[1] https://github.com/kdave/btrfs-progs/blob/master/crypto/crc32c.c#L30
[2] https://gist.github.com/robn/7e1f27963710c17eda0f

In typical fashion right after I post this.... I think we may be able to get more general tests with the fio tool as it's designed for. I've not spend any time actually running tests with this. It has a vast array of options for generating load/threads etc.

A no frills test on my laptop:


$ fio --crctest=
fio: unknown hash ``. Available:
md5
crc64
crc32
crc32c
crc16
crc7
sha1
sha256
sha512
xxhash
murmur3
jhash
fnv
sha3-224
sha3-256
sha3-384
sha3-512
[root@laptop]# fio --crctest=crc32c
crc32c: 9887.61 MiB/sec
[root@laptop]# fio --crctest=xxhash
xxhash: 7396.92 MiB/sec

So some basic testing across a number of devices with just "fio --crctest=crc32c; fio --crctest=xxhash", if someone wants to put together a more indicative storage throughput test I can retest, but this should just be purely cpu/memory throughput and should be indepentent of the actual storage, a lot of the devices below have storage slow enough (microSD/eMMC) that I doubt either hash perf would have any real world impact.

I think the summary here is that Intel/aarch64 systems tend to actually be faster with crc32c than xxhash, on 32 bit arm platforms where there's no HW crc32c it's slower, but we also tend not to focus on desktop environments there anymore so btrfs is less common.

Other notes:
It's no shock that the Mustang is slow, it's pre crc32c HW on aarch64 (very first aarch64 device).
Yes, it amuses me that it crashes for crc32c on the RPi devices:
https://bugzilla.redhat.com/show_bug.cgi?id=1962050

x86_64 devices:
UpSquared (Atom Apollo Lake):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 6175.82 MiB/sec
xxhash: 3090.70 MiB/sec

Compulabs Fitlet2 (Atom Apollo Lake):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 4922.89 MiB/sec
xxhash: 2475.89 MiB/sec

Lenovo Carbon X1 gen 6 (i7-8650U):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 8124.66 MiB/sec
xxhash: 6982.90 MiB/sec

aarch64 devices:
NVIDIA Jetson Xavier AGX:
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 5937.06 MiB/sec
xxhash: 3161.35 MiB/sec

NVIDIA Jetson Nano:
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 7506.67 MiB/sec
xxhash: 3436.10 MiB/sec

Solid Run Honeycomb - NXP Layerscape 2160A:
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 10765.35 MiB/sec
xxhash: 4069.82 MiB/sec

Raspberry Pi4:
$ fio --crctest=crc32c; fio --crctest=xxhash
Illegal instruction (core dumped)
xxhash: 2774.11 MiB/sec

Raspberry Pi3B+:
$ fio --crctest=crc32c; fio --crctest=xxhash
Illegal instruction (core dumped)
xxhash: 1612.94 MiB/sec

Solid Run MacBin:
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 8270.07 MiB/sec
xxhash: 2984.97 MiB/sec

NVIDIA Jetson TX2:
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 7266.12 MiB/sec
xxhash: 3473.92 MiB/sec

Rock960 (Rockchip rk3399):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 8777.04 MiB/sec
xxhash: 3335.90 MiB/sec

Old aarch64/armv7 platforms without crc32c HW instructions:
Mustang (x-gene 1):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 251.90 MiB/sec
xxhash: 1838.27 MiB/sec

Panda-ES (omap4)
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 79.68 MiB/sec
xxhash: 365.88 MiB/sec

BeagleBlack (am33xx):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 28.20 MiB/sec
xxhash: 90.72 MiB/sec

Wandboard (i.MX6):
$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 85.02 MiB/sec
xxhash: 393.08 MiB/sec

Wow, this is some fantastic insight. It seems like the hit is pretty significant for switching the xxhash across the board, even on x86_64, because the kernel lacks hardware optimized variants of xxhash.

With this information in mind, do we want to continue considering xxhash?

And crc32c also wins on my AMD Threadripper system:

$ fio --crctest=crc32c; fio --crctest=xxhash
crc32c: 10890.37 MiB/sec
xxhash: 8065.02 MiB/sec

although both of these numbers are at the "wow that's fast" level so I don't mind having picked xxhash for its length for my own system.

I'd be curious if sha256 would be fast, given that it usually has hardware acceleration, right? Or am I misremembering?

I'd be curious if sha256 would be fast, given that it usually has hardware acceleration, right? Or am I misremembering?

The answer is basically considerably slower:

fio --crctest=sha256

Carbon X1:
sha256: 60.89 MiB/sec

UpSquared:
sha256: 99.63 MiB/sec

Jetson AGX:
sha256: 170.08 MiB/sec

Honeycomb:
sha256: 137.73 MiB/sec

Rock960:
sha256: 112.98 MiB/sec

Amused that the cheap arm device is almost double the speed than the Thinkpad, interesting the Atom is 1.5 times. Wonder if that's because of LUKS/VPN and other crytpo on my laptop.

I would guess blake2b would see a similar performance dip, as it was a SHA-3 contestant...?

I would guess blake2b would see a similar performance dip, as it was a SHA-3 contestant...?

You could run it locally yourself ;-)

There's definitely HW accelerated SHA-3 but I'm not sure how widely deployed it is. I can add it to the todo list.

fio doesn't seem to have a blake2b option. But:

$ fio --crctest=crc32c; fio --crctest=xxhash; fio --crctest=sha3-224
crc32c: 10964.07 MiB/sec
xxhash: 8108.71 MiB/sec
sha3-224: 81.50 MiB/sec

Only 100x slower than xxhash on my AMD system. :)

@josef Do you know if there's any efforts going on for hardware-accelerated xxhash implementations in the kernel?

it's possible to compile hash-speedtest using --with-crypto=libkcapi which uses the kernel crypto API instead of built-in, which is the default.

Last I checked it makes a pretty big difference with sha256. But also we're not getting hardware accelerated sha256 with Fedora kernels because the generic module is what gets loaded by default. I think it's because btrfs is built-in, while CONFIG_CRYPTO_SHA256_SSSE3=m and the API isn't smart enough to use generic for starters, but then later switch to SSSE3 once available.

BLAKE was a SHA-3 finalist and ~3x faster than Keccak which was finally selected. BLAKE2 is even faster, beating SHA-256 by quite a lot. But SHA-3 isn't supported in Btrfs nor is xxHash-3, the successor to xxhash64. Note that a using a checksum other than crc32c means a btrfs file system won't be mountable with a kernel older than 5.5. This isn't a big concern in Fedora land but it might be a concern for downstreams.

xxhash64 was picked mainly to mitigate collision potential compared to that of crc32c while not being meaningfully computationally expensive. So maybe Fedora could lead on determining a threshold for when to use xxhash by default? Collision potential increases as file system size increases, so maybe mkfs should flip to xxhash at a certain block device size? Since Btrfs trivially grows just by adding devices, the mkfs threshold should probably flip the default to xxhash well before there's some meaninful chance of collisions. Maybe 2TiB?

@josef Do you know if there's any efforts going on for hardware-accelerated xxhash implementations in the kernel?

xxhash is already optimized for CPUs, utilizing instruction parallelism, so there's not much chance to make it better as it's not standardized and widely used, compared to crc32c.

Recent benchmarks
https://github.com/kdave/btrfs-progs/issues/548#issuecomment-2058890102

And recent update on adding more algorithms, e.g. hhX3
https://github.com/kdave/btrfs-progs/issues/548#issuecomment-2064050115

I think it's unlikely Fedora will deviate from upstream, so I'm not sure we need to keep this issue open.

Note, there is experimental support for checksum conversion mentioned later in the thread.

Log in to comment on this ticket.

Metadata