#12377 Build restarting on s390x builder
Opened 5 months ago by catanzaro. Modified a month ago

Describe what you would like us to do:


Please investigate this build which is restarting on s390x (presumably out of memory): https://koji.fedoraproject.org/koji/taskinfo?taskID=128394527

When do you need this to be done by? (YYYY/MM/DD)


Preferably 2025/01/27


Ugh.

So, some of the s390x builders (16-21) have less cpus and memory. I wonder if it hit one of those for the first cycle(s)

it's on 03 now, which has more memory and cpus.
but it's been building for like 10 hours. ;(

I guess if it fails this time I will scrap some smaller builders and build up a larger one. ;(

Presumably the smaller builders should just not be assigned to the heavybuilder channel, right?

And looking... they aren't, so it's the 'larger' ones that are no longer able to build it. ;(

So, I scrapped 21 and added it's memory to 20 and made that one the only one in the heavybuilder channel.

In fact it failed right around then and restarted and moved to 20. ;)

I guess lets see...

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: high-gain, medium-trouble, ops

5 months ago

Build succeeded (after 78 hours :) so I think that worked. Thanks!

With only one heavybuilder there is generally going to be a long queue for WebKitGTK and Chromium builds, but I certainly do prefer waiting to build vs. unreliable or restarting builds.

Unfortunately I see builds restarting again:

I see the builder has 3 vCPUs and about 44 GB of RAM, so roughly 15 GB of RAM per vCPU, which should be far more RAM than required. But I think it might be running two jobs at once? Even so, it should still have more than enough RAM/vCPU to compile WebKitGTK. It might not be enough to link two WebKitGTKs at the same time, though? Linking requires a huge amount of RAM even without any parallelism.

The problem is that the host is OOM killing the vm.

So, I need to try and figure out why thats happening. ;( I might need to drop another 'normal' vm to make sure there's enough free memory for it.

I guess this is okish? I have seen the vm killed, but not often. I could try reducing it's memory slightly, or dropping another builder. ;(

I think it's time to drop at least one more builder.

We might need to consider limiting s390x to ELN only if we can't get more resources.

Well, thats not something that would be decided in this ticket.

I'll try and see if I can rebalance things again.

So, I guess you are seeing this still/again/more?

I think I've seen a couple recent restarts on s390x specifically, but it's not nearly so bad as it was before. If a build had gotten stuck in a restart loop and wasn't making any progress, then I would have said something here. Builds are mostly stable now. But I think s390x will still occasionally restart and so wind up building twice before finishing, which is not ideal. I'm not certain, though.

So my suggestion is to rebalance a little, but not a lot.

Another problem is that now if an SRPM build gets assigned to an s390x builder, the SRPM build will be blocked until everything else is finished, so s390x may block all other architectures. But this might not be worth spending any time investigating, since it only slows down subtasks and does not affect the speed of the actual overall build (because s390x is the slowest architecture anyway).

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog