Issue #10910: Builds should not restart if they hit OOM - releng

releng

#10910 Builds should not restart if they hit OOM

Closed: Fixed with Explanation 2 days ago by jnsamyak. Opened 2 years ago by catanzaro.

When a build hits OOM, it should just fail so the packager knows there is a problem. Recently, builds have begun restarting, only to fail again, restart, fail again, etc. This is unhelpful as it slows things down. If the build hit OOM once, it is going to do so again: it should not be retried.

The impact is: packagers get confused waiting ages for builds to complete. It looks like the build is still in progress. Only if you pay close attention do you notice that the subtasks keep restarting.

smooge commented 2 years ago

Things to answer:
1. Can koji do this already or does it need a feature request upstream?

churchyard commented 2 years ago

I respectfully disagree with the premis here that this is always wrong. Retrying is the only way we were able to build Pythons on arm32. If the build fails, we would need to resubmit it again and again and we would waste resources building the other architectures needlessly.

Metadata Update from @phsmoura:
- Issue tagged with: low-gain, low-trouble, ops

2 years ago

kevin commented 2 years ago

So, I think there's a misunderstanding of whats happening here.

The builder takes the job, starts building it, hits an OOM condition and... since it's looking at cgroup's it kills the hog: kojid

Before it would just mean that builder was dead in the water until someone came and restarted kojid there.
Then, we set it to autorestart on failure, so now it is killed and systemd restarts it and... sometimes it gets into a loop.

So, I am unsure how we can improve things. Perhaps mock could run builds in a seperate cgroup? But then we still would need to decide if thats a fatal error or a retry event.

All that said, we had systemd-oomd enabled since my last builder reinstall, and I think it's overeager. I have disabled it.
Can we see if doing that helps any?

catanzaro commented 2 years ago

All that said, we had systemd-oomd enabled since my last builder reinstall, and I think it's overeager. I have disabled it.
Can we see if doing that helps any?

I'm not sure if it's relevant, but one of my builds https://koji.fedoraproject.org/koji/taskinfo?taskID=89692887 restarted 16 minutes ago, whereas your comment was 26 minutes ago. So that's probably a bad sign?

Worse, in that build I increased the %limit_build from 2 GiB RAM per vCPU up to 3 GiB RAM per vCPU. I guess I could try 4 GiB per vCPU, but... that's concerning. Even WebKit shouldn't need that much. :/

catanzaro commented 2 years ago

I respectfully disagree with the premis here that this is always wrong. Retrying is the only way we were able to build Pythons on arm32. If the build fails, we would need to resubmit it again and again and we would waste resources building the other architectures needlessly.

Doesn't that seem... pretty awful?

churchyard commented 2 years ago

It is pretty awful, yes :(

kevin commented 2 years ago

So, this is happening on x86 it seems?

I can give the buildvm-x86 buildvm's more memory pretty easily. It will require rebooting them however. I can bump them from 15gb to 24gb.

Do you know what builder it was on before it restarted? I can look at the logs for the oom messages if I know which one it was on when oom hit.

catanzaro commented 2 years ago

So, this is happening on x86 it seems?

We also had trouble with s390x earlier today, see https://koji.fedoraproject.org/koji/taskinfo?taskID=89676287

kevin commented 2 years ago

ok. I gave all the buildvm-x86 vm's 24gb memory (instead of 15 they had before).

Lets see if that and oomd being disabled gets us to a stable place.

catanzaro commented 2 years ago

It looks like x86_64 is stabilized... at least, my build succeeded. Hopefully that wasn't just luck.

Unfortunately, my s390x build restarted after 10 hours: https://koji.fedoraproject.org/koji/taskinfo?taskID=89692890. Ideally koji would "poison" the build to ensure it is only ever started once, so we get an error if it fails for any reason. The restart loop is not useful.

catanzaro commented 2 years ago

Unfortunately, my s390x build restarted after 10 hours: https://koji.fedoraproject.org/koji/taskinfo?taskID=89692890. Ideally koji would "poison" the build to ensure it is only ever started once, so we get an error if it fails for any reason. The restart loop is not useful.

This builder is still running 55 hours after I started the job. So clearly jobs are still restarting. It's not an efficient use of our resources. A notice that the build has failed would be much more useful than multi-day build limbo.

Maybe the s390x builder needs a little more RAM? I don't know, because the previous job appears to disappear from koji's web UI when the new job is started: I only see the latest job, which has not yet failed.

smooge commented 2 years ago

@catanzaro

there is a mass rebuild of all of Fedora 37 going on currently. All the builders are going to be slower but there are not going to be any changes to the infrastructure until that is complete.
while the restart loop is not useful for you, it is useful over the fact that Fedora Infrastructure has very limited infrastructure which is going to fail for multiple odd reasons all the time:
a. there are only N available virtual builders for each architecture and no room to add more.
b. there is currently only 1 main sysadmin and 1 release engineer and they are the same person due to multiple conflicting issues. Having autorestarts means they can try to do both jobs a little versus no jobs at all due to having to deal with 50 daily tickets about builds failing but no time to do root cause analysis.

Yes this is clearly not optimal and needs improvement. It will take a lot of effort and time to do so.

catanzaro commented 2 years ago

Sigh, OK. But note (a) this is not how things worked until recently: we used to just see normal build failures for OOM in the past, and things worked fine, and (b) it's going to result in major trouble not just for WebKitGTK, but also for other large C++ projects: LibreOffice, Inkscape, Chromium, Firefox, etc.

Regarding the ongoing s390x build loop, which is now up to 70 hours, I suppose we can discuss that in #10909.

kevin commented 2 years ago

Sigh, OK. But note (a) this is not how things worked until recently: we used to just see normal build failures for OOM in the past, and things worked fine, and (b) it's going to result in major trouble not just for WebKitGTK, but also for other large C++ projects: LibreOffice, Inkscape, Chromium, Firefox, etc.

Thats not true. It's done this since...
f2fd9f897c5 (Kevin Fenzi 2020-09-01 10:19:18 -0700 9) Restart=on-failure

2020 at least.

Anyhow, I am not sure how we could even change this behavior. If kojid is OOM killed, it can't do anything about failing the build. I suppose it could look for OOM in logs when it started, but that doesn't mean the current build would oom.

@tkopecek any thoughts on this issue?

tkopecek commented 2 years ago

Generally OOM once doesn't mean that it will fail always. There could be a bunch of reasons. Most frequent is that some "big" builds are in the same time on same machine without respecting each other (which is not bad behaviour but happens more and more with current software). When build is restarted it can:
a) end on the same builder but not competing for resources
b) end on the same builder and fail again due to parallel build
c) end on another builder with different resources, builds, etc - completely different start point
d) fail everytime - which is the case we would like to detect but it seems almost impossible to me.

Possible things to do:
1) This or next week @mikem should finish new scheduler design - it will allow to "reserve" resources, so dev (or better rel-eng) should have ability to tweak memory/disk/instruction set... requirements per tag/package. Anyway, getting it production ready will take two or more releases. Build which will fail under these fences shouldn't be retried.
2) Short-term - mock's nspawn backend already run in different cgroup, so it shouldn't kill kojid, but as you've written it will also result in restarting task forever.
3) I've created two hackish solutions for similar cases: a) "beefy" channel with builders with capacity 1.4 - it means that only one build at time can run there. It is waste of computing resources but we don't need to waste human resources to debug if there was parallel build which exhausted memory (meson is a specialist there, webkit and company are similar) b) koji patch to increase weight of buildArch for specific packages to 51% of capacity of builder for some builds which can't run on the same machine due to enormous disk usage (some kernel flavours). It is different from beefy as there can still run some other builds.

Edited 2 years ago by tkopecek

kevin commented 2 years ago

Yeah, it's an anoyingly difficult problem.

I really don't like seperate heavy build channels, because it means those resources either sit idle a lot of the time, or get overwhealmed by builds. ie, if you have 2 big builders in a channel thats great, but if there's say 3 webkitgtk builds a chromium and 2 libreoffices, the ones that arrive later just have to wait for a builder to free up, making things take a long time.

Anyhow, I would say we should close this and see if the new scheduler helps out...

jnsamyak commented 2 days ago

Closing this as per the last comment, and reading the above comments, if there is anything still needed please free to reopen this ticket again!

Metadata Update from @jnsamyak:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 days ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Milestone

None

duplicate

None

blockedby

None

blocking

None

Boards 1

Ops Status: Backlog

releng

Source Code

Documentation

#10910 Builds should not restart if they hit OOM Closed: Fixed with Explanation 2 days ago by jnsamyak. Opened 2 years ago by catanzaro.

Metadata

low-gain ops low-trouble

Boards 1

#10910 Builds should not restart if they hit OOM

Closed: Fixed with Explanation 2 days ago by jnsamyak. Opened 2 years ago by catanzaro.