#632 [F33] SwapOnZRAM Test Day
Closed: Fixed 3 years ago by chrismurphy. Opened 3 years ago by chrismurphy.

Change proposal: Setup a ZRAM device, mkswap and swapon during early boot. All editions and spins; and hopefully also including upgrades. New default installs would have no swap-on-disk. New custom installs would have two swaps, with ZRAM one being higher priority.

RHBZ tracker bug


Hey Chris, it's a good idea to have a test day for this. @sumantrom will help you organize it.

Metadata Update from @kparal:
- Issue assigned to sumantrom
- Issue set to the milestone: Fedora 33
- Issue tagged with: test days

3 years ago

I don't, but @zbyszek @catanzaro might have suggestions. I'm fairly flexible.

Any day is OK for me, I just need to know a few days in advance.

The sooner the better. I'm open to suggestions on the proper sequence. @bcotton ?

  • change lands on devel@
  • test day
  • fesco

Is June 10 too soon?

@chrismurphy June 8th is CoreOS test day. I would like to have it on 12th or 15th if that not too late.
Also we need to chalk out test cases.

Either 12th or 15th work for me; I need to defer to @catanzaro because he's going to bring in a bunch of GNOME folks too, and Friday vs Monday might matter on turn out.

@chirsmurphy that's great.
Please let me and I will create an wiki and ping you and @catanzaro for test cases.

The sooner the better. I'm open to suggestions on the proper sequence. @bcotton ?

I don't think it's necessary to have the test day prior to FESCo submission, so the ordering doesn't particularly matter too much. I'll announce the change proposal now.

Either 12th or 15th work for me;

The 12th is not a good choice because Red Hatters will be enjoying an internal holiday... let's use the 15th.

I need to defer to @catanzaro because he's going to bring in a bunch of GNOME folks too, and Friday vs Monday might matter on turn out.

This is news to me! I didn't have plans to do more than test it out myself. I can try to round up some people, of course, but I wouldn't expect a crowd.

I remember this being a more grandiose suggestion than is reflected in the minutes. What I recall is something akin to a GNOME test day that would get broader coverage than just Fedora QA folks who perhaps have more idealized setups.
15:25:43 <aday_> Michael: suggests advertising the test day to ensure wider participation

Well of course we should advertise the test day. I would ask for participants on devel@, which is likely to result in at least a couple volunteers, I hope.

@catanzaro @chrismurphy I tried sign-in to the wiki and its a 504 since last night. I think this is due to the DC move. Do you guys want to hold the test day until, we get things back ?

We have to postpone the testday, we can't function without the wiki. Sumantro, have any announcements already been sent out?
Let's try the wiki login on Monday, and if it works, we can do the test day e.g. on Jun 18th? If the wiki is still broken, we'll need to wait further.

@chrismurphy the test day page is https://fedoraproject.org/wiki/Test_Day:F33_SwapOnZRAM
The entire How to test and test day pre-req is kept open, which I will fill in. If you are around, feel free to edit and fill in.
I will be requiring a bunch of use cases where user might be testing (for the test cases )for something specific.
@kparal thanks a ton for filing the infra issue.

Looks like the wiki issue is fixed. Shall we schedule for June 18?

We dont yet have the test cases, I would love to have the test cases complete and then schedule a date. 18/19 both sound great.
@kparal any thoughts? help? insights?

Reminder: I am not a scientific sample, and my test cases aren't everyone's workload. If I suggest my tests too strongly, literally I will be skewing the test results.

The difficulty with the test case instructions is how to convey to testers that they should basically try to do workloads that they know are in the realm of 90-110% of their system's capacity. We want to make sure stressed workloads they expect to work, still work.

zram and suspend (S3) don't interact but it's useful to hammer on suspend tests just to build confidence in this regard.

It might be true some workloads users avoid, due to lack of resources, will work better. But at least if they aren't worse, is what we want to discover. And if they are worse, we want to ask them to retest with zram device at 75% RAM (i.e. increase it).

For example: I do a lot of tests building the linux kernel, and also webkitgtk. The former test is perhaps not stressful enough memory wise, it's very CPU dominant. Whereas the latter test is flat out underprovisioned on most setups, using defaults. It's great for torture testing.

But torture testing isn't really the goal here. Instead it's to ask users to "do the things you'd normally do that really stresses your system" but they're going to do it with zram based swap instead of disk based swap and see if it exposes some new/unexpected behavior.

I'm a power user and I have no idea which regular workload I'd use to test this. I don't think I ever use all 16GB RAM I have, I probably never swap. I expect most people will similarly have no idea. We can only target people who do know about such a workload (and know it's memory-related and not CPU or IO related, which is another big assumption), but I'm not sure how many there are. So I guess some artificial workload is needed for all the rest. For example, use this command to occupy 80% of your RAM, and then open the web browser and open up a lot of pages, try switching between then or to different applications. Then set up zram and repeat the same steps, did that feel better? Another option is to target users of low RAM devices, like ARM boards, netbooks or very old PCs, where users know that that can't afford to have a web browser and GIMP opened at the same time, etc. I agree that coming up with some good test cases is going to be challenging (and it needs to happen today and tomorrow, if we target Thursday).

We can also have a few technical test cases, like "enable zram, did it work?" and "enable zram with disk swap present, is it higher priority?", but the latter is still unimplemented, IIUIC, so the number of these technical test cases is probably going to be low.

Any synthetic test to hog memory ends up altering the compression ratio, unrealistically. And using, e.g. mem=4G, also substantially changes the local configuration away from "current" and thus working as good or better with the user's normal workload.

We might consider pushing this back to get more expected things from upstream, since some behaviors are going to change regarding how it will be enabled/disabled, and whether it will be possible in the initial release to set a cap on the size of the zram device.

@zbyszek @ignatenkobrain would it help to have a ~ 1 week reprieve on those change before doing the test day?

@zbyszek @ignatenkobrain would it help to have a ~ 1 week reprieve on those change before doing the test day?

I think it makes sense. I don't think there's enough time to properly announce a test day today (it's after midnight here) or tomorrow. We're working on a new upstream release. I hoped it would be done by now, but it's taking longer than expected. Also, there are some selinux denials (also with the version currently in Fedora), and it would be good to update the policy to avoid them before the test day. A week of delay sounds good.

OK there are now new builds I need to test ,and then I can update the test case to confirm it's working; and yeah some test case that's more clear than "ok guys do some stuff and try to break it".

On the packaging side, everything seems to be OK. The new package is in F32 and there are no known outstanding bugs.

some test case that's more clear than "ok guys do some stuff and try to break it".

What about a list of different srpms that use various amounts of memory to build. We could ask people to rebuild a package in koji with requirements close to 100% of their RAM while using the browser, and see if things don't go south. This would fairly easy to execute, but at the same time the differences in machine speed and parallel building would mean that on different machines this would not be entirely repeatable.

It's difficult because the reference for "works as good or better" is the user's system+workload+conventional swap. And the only way I got to certainty on that with webkitgtk building was 100+ tests (crazy) because it sometimes does get bad but it's not worse than a conventional swap partition. Also, for F32 users the webkitgtk compile will either succeed or end with earlyoom killing it, depending on how much memory they have. But a lot of those might be useful especially if it uncovers an edge case that I haven't.

If either of you can write a quick and dirty set of steps (unformatted and put in this issue is fine) how users would do that, I'll test it and write up the actual test case. I never use koji or Fedora packages to do this test myself, I only ever use upstream code and use the first two commands here.

OK I'm thinking of one big test case.
1. confirm swap on zram disabled sudo systemctl stop swap-create@zram0 user should be in their normal configuration
2. run three "rebuild a package in koji" and time them (can we use the time command to help make it easier for the user?) i.e. they should have three times, one for each rebuild of 3 pages. This is their baseline.
3. enable zram with defaults sudo systemctl start swap-create@zram0
4. repeat the three "rebuild a package in koji" and time them, compare.

Times should be the same or better, inclusive of earlyoom triggering. We might in fact see more earlyoom triggers where before people would hang for a while with disk-based-swap soaking their system.

Also maybe there can be a column to subjectively grade system responsiveness during the builds? Grades 1-3 are sufficient I think: responsive, semi-responsive, not very responsive.

Basic sanity test cases I plan to get into wikis tomorrow.

test installation zram-generator only (default disabled)
confirm: swapon shows no /dev/zram0 device

test installation zram-generator-defaults (default enabled)
confirm: /dev/zram0 has a size of 50% RAM or 4G whichever is smaller, priority 100

test disable via empty /etc
touch /etc/systemd/zram-generator.conf
confirm: swapon shows no /dev/zram0 device

test custom config
nano /etc/systemd/zram-generator.conf
options to test: fraction, size cap,
confirm: swapon shows /dev/zram0 device size as expected

So five test cases, with one as a big 3in1 done twice:baseline without zram and then with zram. 11 tests total. :D too much? not enough? just right?

I'm not planning to ask to disable disk based swap. I think it's useful information to gather with it enabled. Even if we don't apply this to upgrades (looking not likely), the current behavior with custom installs where the user manually creates swap means they get both. So we still want to test both.

OK I'm thinking of one big test case.
1. confirm swap on zram disabled sudo systemctl stop swap-create@zram0 user should be in their normal configuration
2. run three "rebuild a package in koji" and time them (can we use the time command to help make it easier for the user?) i.e. they should have three times, one for each rebuild of 3 pages. This is their baseline.

s/koji/mock/.

I wouldn't ask users to do this by default. I would label this as an "advanced test" and only ask people to do it if they feel that the machine behaves worse with zram enabled.

  1. enable zram with defaults sudo systemctl start swap-create@zram0
  2. repeat the three "rebuild a package in koji" and time them, compare.

I think that as long as the package builds fine, we're OK.

The problem with comparing numbers is that depending on how exactly the workload fits available memory, the result might be better or worse than without zram, and the fraction is not meaningful to us. (If the workload matches available RAM closely but actually fits, enabling zram will cause a minor slowdown by increasing CPU usage. OTOH, for workloads that exceed available RAM moderately, enabling zram should give a reasonable speedup. I expect the curve comparing build time w/ zram to build time w/o zram as function of workload size to be flat with a value of 1.0, then go up a bit when we get close to RAM limit, and then down below 1.0 when zram effectively extends available RAM, also extending into the region where build w/o zram will fail with oom, but succeed with zram. But with a single user we can't really say where we are on this curve without running workloads with different sizes.) So as long as it doesn't crash or behave visibly worse than w/o zram, we can treat the test as passed.

@sumantrom Let me know if this is a good enough start on the test cases to schedule the test day?

@zbyszek
I just need 1 or 2 more ideas for what to have folks rebuild, in addition to webkit2gtk3. I know from experience webkitgtk will oom with 8 cores and less than ~18G RAM.

@chrismurphy
I am going through the test cases and lets schedule the test day starting next week or end of this week (3rd).. does that work with you or do you want it on Thursday?

The test cases look good. I think the most likely failure mode is selinux issues.

Thursday is good for me.

Thursday is good for me.
Setting up test day bits, gimme like 30 mins

@chrismurphy the event page is here https://testdays.fedorainfracloud.org/events/86
I am going to call for test day on the @test & @test-annouce list and community blog for Monday 2020-07-06
Sounds good?

Is this expected to work on an encrypted disk?

It's supposed to work with any setup which already has a working hibernation, right cmurf? We might want to state this in the test case.

Is this expected to work on an encrypted disk?

It's supposed to work with any setup which already has a working hibernation, right cmurf? We might want to state this in the test case.

Indeed hibernation doesn't work in my setup, regardless of swap on zram :-)

It's supposed to work with any setup which already has a working hibernation, right cmurf? We might want to state this in the test case.

Agreed.

I have adjusted test case instructions and renamed the test cases a bit.

Packages we want users to test hopefully will make it to stable by the start of the test day.
https://bodhi.fedoraproject.org/updates/?search=zram-generator

These are the meta packages. The actual packages omit 'rust' - so the name used in the test cases for installation are correct.
rust-zram-generator-0.2.0-1.fc31
rust-zram-generator-0.2.0-1.fc32
rust-zram-generator-0.2.0-1.fc33

New test case added to test day page; but it's not yet on the results page (I'm not sure how to do that).
https://fedoraproject.org/wiki/QA:Testcase_SwapOnZRAM_buildwkgtk

I've update the hibernation wiki (maybe another one too), so that could use a formatting and sanity check to make sure I didn't foul it up.

@chrismurphy I have checked the hibernation and the buildwkgtk test cases, they are added to the results page https://testdays.fedorainfracloud.org/events/86

@chrismurphy this test day went successfully, unless you have anything else to add.. can we close this ticket as fixed?

Metadata Update from @chrismurphy:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata