#11839 Out of memory on ppc64le builder
Closed: Fixed a month ago by catanzaro. Opened 7 months ago by catanzaro.

Describe what you would like us to do:


Please investigate this out of memory error on the ppc64le builder:

https://koji.fedoraproject.org/koji/taskinfo?taskID=115050608

Note the error occurs when linking:

collect2: fatal error: ld terminated with signal 9 [Killed]

and LTO is disabled, so this is not a parallelized stage of the build and we cannot reduce resource usage by requesting more RAM per vCPU.

CC: @kalev

When do you need this to be done by? (YYYY/MM/DD)


As soon as possible on or after 2024/03/18


So, the virthost that builder is on seems to have gotten somewhat stuck doing a raid check...

I cleared the check and it seems to be returning to normal now.

Of course this might not be related. Can you try a new build?

New build is successful!

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

7 months ago

Issue status updated to: Open (was: Closed)

7 months ago

Metadata Update from @zlopez:
- Issue tagged with: low-trouble, medium-gain, ops

7 months ago

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

7 months ago

Metadata Update from @catanzaro:
- Issue status updated to: Open (was: Closed)

2 months ago

This has happened again: https://koji.fedoraproject.org/koji/taskinfo?taskID=121900685

I will restart the build and hope for the best. It's the linker that's being killed, and LTO is turned off, so there is no parallelism to reduce.

I suspect this was caused by the unplanned eln mass rebuild that fired off at branching.

There's currently... 2k plus builds pending for that and so it's running a number of them on all the builders. I guess I could try taking the weight down so it tries to do less per builder, but then that will cause problems for mass rebuilds, etc.

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 months ago

Got another one: https://koji.fedoraproject.org/koji/taskinfo?taskID=122003276

I'll restart this one too. This time it died during compilation, where we can at least reduce parallelism. But we probably shouldn't have to; the current settings are already very conservative.

Trying to spot a pattern here. I guess both of these are on buildvm-ppc64le-3* builders? (ie, on the same virthost hardware).
I'll dig and see if I can see anything off with that virthost.
I can also try upgrading it to the latest kernel, etc.

Are things looking any better now?

I did upgrade all the virthosts and builders and tried to shuffle things around some...

I haven't noticed any more WebKtiGTK build failures since I reported this.

I did have a ppc64le glib2 build fail on Monday, August 26 due to a timeout when running the tests. That wasn't an OOM issue, though. The builder was just excessively slow. Restarting the build fixed it.

Two more failures: https://koji.fedoraproject.org/koji/taskinfo?taskID=122954535 and https://koji.fedoraproject.org/koji/taskinfo?taskID=122954506

The failures occur when linking and LTO is already disabled, so it's not a parallelized build step and resource requirements cannot be further reduced.

How about we half the number of ppc64le heavybuilders and double their memory? I won't be thrilled about waiting longer for builds, but waiting is better than running out of memory.

Humf. Well, there are only 3 of the heavybuilder ones. ;(

Also we are in freeze for beta, so I don't want to do a big reshuffling.

However, I see a way I could resize another one much larger, and could just put that one in heavybuilder. It would only be one builder, but I could give it a bunch more memory so it shouldn't at least oom.

I'll put in a freeze break to do that and after freeze look at more sustainable tweaking.

However, I see a way I could resize another one much larger, and could just put that one in heavybuilder. It would only be one builder, but I could give it a bunch more memory so it shouldn't at least oom.

That sounds good, at least to handle the build emergency. Thanks.

I'll resume trying to build WebKitGTK once you've got this in place.

ok. This is now in place.

Please let me know if it helps.

Well that did solve the OOM problem, thanks!

My ppc64le builds are still failing, though:

I've never seen anything like this before:

debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/WebKitNetworkProcess: Unit type 2 unhandled
readelf: Error: Unable to find program interpreter name
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/WebKitWebProcess: Unit type 2 unhandled
readelf: Error: Unable to find program interpreter name
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/jsc: Unit type 2 unhandled
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/MiniBrowser: Unit type 2 unhandled
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/WebKitNetworkProcess: Unit type 2 unhandled
readelf: Error: Unable to find program interpreter name
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/WebKitWebProcess: Unit type 2 unhandled
readelf: Error: Unable to find program interpreter name
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/jsc: Unit type 2 unhandled

Also:

error: Empty %files file /builddir/build/BUILD/webkitgtk-2.45.92-build/webkitgtk-2.45.92/debugsourcefiles.list

I might need to ask for help on devel@ mailing list.

wow... odd.

Perhaps @sharkcz would have some idea (if this is only happening on ppc64le)

The OOM problem does seem to be fixed though, so I'll close this.

I'll create a devel@ mailing list thread to ask for help if sharkcz doesn't know what's wrong.

Metadata Update from @catanzaro:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

I haven't seen such issue yet, I will take a look. It might be something for our toolchain team ...

I have reproduced the debugedit failures and reported as https://bugzilla.redhat.com/show_bug.cgi?id=2310828

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog