Please investigate this out of memory error on the ppc64le builder:
https://koji.fedoraproject.org/koji/taskinfo?taskID=115050608
Note the error occurs when linking:
collect2: fatal error: ld terminated with signal 9 [Killed]
and LTO is disabled, so this is not a parallelized stage of the build and we cannot reduce resource usage by requesting more RAM per vCPU.
CC: @kalev
As soon as possible on or after 2024/03/18
So, the virthost that builder is on seems to have gotten somewhat stuck doing a raid check...
I cleared the check and it seems to be returning to normal now.
Of course this might not be related. Can you try a new build?
New build is successful!
Metadata Update from @zlopez: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Issue status updated to: Open (was: Closed)
Metadata Update from @zlopez: - Issue tagged with: low-trouble, medium-gain, ops
Metadata Update from @catanzaro: - Issue status updated to: Open (was: Closed)
This has happened again: https://koji.fedoraproject.org/koji/taskinfo?taskID=121900685
I will restart the build and hope for the best. It's the linker that's being killed, and LTO is turned off, so there is no parallelism to reduce.
I suspect this was caused by the unplanned eln mass rebuild that fired off at branching.
There's currently... 2k plus builds pending for that and so it's running a number of them on all the builders. I guess I could try taking the weight down so it tries to do less per builder, but then that will cause problems for mass rebuilds, etc.
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review)
Got another one: https://koji.fedoraproject.org/koji/taskinfo?taskID=122003276
I'll restart this one too. This time it died during compilation, where we can at least reduce parallelism. But we probably shouldn't have to; the current settings are already very conservative.
Trying to spot a pattern here. I guess both of these are on buildvm-ppc64le-3* builders? (ie, on the same virthost hardware). I'll dig and see if I can see anything off with that virthost. I can also try upgrading it to the latest kernel, etc.
Are things looking any better now?
I did upgrade all the virthosts and builders and tried to shuffle things around some...
I haven't noticed any more WebKtiGTK build failures since I reported this.
I did have a ppc64le glib2 build fail on Monday, August 26 due to a timeout when running the tests. That wasn't an OOM issue, though. The builder was just excessively slow. Restarting the build fixed it.
Another OOM build failure today: https://koji.fedoraproject.org/koji/taskinfo?taskID=122897773
Two more failures: https://koji.fedoraproject.org/koji/taskinfo?taskID=122954535 and https://koji.fedoraproject.org/koji/taskinfo?taskID=122954506
The failures occur when linking and LTO is already disabled, so it's not a parallelized build step and resource requirements cannot be further reduced.
How about we half the number of ppc64le heavybuilders and double their memory? I won't be thrilled about waiting longer for builds, but waiting is better than running out of memory.
Humf. Well, there are only 3 of the heavybuilder ones. ;(
Also we are in freeze for beta, so I don't want to do a big reshuffling.
However, I see a way I could resize another one much larger, and could just put that one in heavybuilder. It would only be one builder, but I could give it a bunch more memory so it shouldn't at least oom.
I'll put in a freeze break to do that and after freeze look at more sustainable tweaking.
That sounds good, at least to handle the build emergency. Thanks.
I'll resume trying to build WebKitGTK once you've got this in place.
ok. This is now in place.
Please let me know if it helps.
Well that did solve the OOM problem, thanks!
My ppc64le builds are still failing, though:
I've never seen anything like this before:
debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/WebKitNetworkProcess: Unit type 2 unhandled readelf: Error: Unable to find program interpreter name debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/WebKitWebProcess: Unit type 2 unhandled readelf: Error: Unable to find program interpreter name debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkit2gtk-4.1/jsc: Unit type 2 unhandled debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/MiniBrowser: Unit type 2 unhandled debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/WebKitNetworkProcess: Unit type 2 unhandled readelf: Error: Unable to find program interpreter name debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/WebKitWebProcess: Unit type 2 unhandled readelf: Error: Unable to find program interpreter name debugedit: /builddir/build/BUILD/webkitgtk-2.45.92-build/BUILDROOT/usr/libexec/webkitgtk-6.0/jsc: Unit type 2 unhandled
Also:
error: Empty %files file /builddir/build/BUILD/webkitgtk-2.45.92-build/webkitgtk-2.45.92/debugsourcefiles.list
I might need to ask for help on devel@ mailing list.
wow... odd.
Perhaps @sharkcz would have some idea (if this is only happening on ppc64le)
The OOM problem does seem to be fixed though, so I'll close this.
I'll create a devel@ mailing list thread to ask for help if sharkcz doesn't know what's wrong.
Metadata Update from @catanzaro: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
I haven't seen such issue yet, I will take a look. It might be something for our toolchain team ...
I have reproduced the debugedit failures and reported as https://bugzilla.redhat.com/show_bug.cgi?id=2310828
debugedit
Log in to comment on this ticket.