#11618 Build restarting on ppc64le heavybuilder
Closed: Fixed with Explanation 9 months ago by catanzaro. Opened 11 months ago by catanzaro.

Describe what you would like us to do:


The build https://koji.fedoraproject.org/koji/taskinfo?taskID=108905303 is restarting again and again. I think I can only see logs for the build in progress and not the restarted ones, but in the past this has indicated an out of memory condition.

Looking at hw_info.log I see:

Memory:
               total        used        free      shared  buff/cache   available
Mem:        20124608     1239040     3107712        4096    15777856    18735104
Swap:        8388544      239424     8149120

I think this means 3107712 kB free which is roughly 3 GiB RAM total, which is just not enough.

When do you need this to be done by? (YYYY/MM/DD)


2023/11/13 or when possible


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: koji, low-gain, low-trouble, ops

11 months ago

Did you do something to fix this? The build completed about 8 hours after I reported this issue.

Total time 40:01:54

Task time 6:16:31

(That is, 34 hours were wasted building again and again and discarding the result.)

Nothing specifically no. I've been very backlogged of late.

We did do a update/reboot cycle yesterday of all hosts, including builders.

Note that "free" there is what should better be called 'wasted'. ie, its not being used at all. You can see cache/buffers is 15GB, so if more memory is needed it should drop caches to get it.

Was this always failing on ppc64le? Are the items that @kalev put in for debuginfo generation still being used? (although I guess this is f37 and might be before those were added)?

Nothing specifically no. I've been very backlogged of late.

We did do a update/reboot cycle yesterday of all hosts, including builders.

Hm, the build succeeded on Monday, before the update/reboot.

Note that "free" there is what should better be called 'wasted'. ie, its not being used at all. You can see cache/buffers is 15GB, so if more memory is needed it should drop caches to get it.

Was this always failing on ppc64le?

No, builds for all architectures have been reliable since January, when you fixed #11000. (Looking through my old infrastructure tickets, I think our build infrastructure has been good for WebKitGTK with the exception of basically all of 2022, which seemed like a struggle. My other recentish issue was #10544.)

I also had three other equivalent builds for ppc64le succeed without difficulty a day or two prior to this failed one. However, this particular build appears to have restarted six times, so this was not a transient issue and is likely to reoccur. I suspect not enough RAM; maybe the logic to drop caches did not operate quickly enough? RAM use can go from low to extremely high in a very short amount of time.

Are the items that @kalev put in for debuginfo generation still being used? (although I guess this is f37 and might be before those were added)?

It was in place but likely ineffective. We have:

# Require 32 GB of RAM per vCPU for debuginfo processing. 16 GB is not enough.
%global _find_debuginfo_opts %limit_build -m 32768

to request 32 GB of RAM per vCPU during debuginfo processing. This is the most memory-intensive part of the build so it's where problems are most likely to occur. But all this can do is limit parallelism to 1 job; if there's not enough RAM for 1 job, it's still going to fail.

(We also have %cmake_build %limit_build -m 3072 to request 3 GB of RAM per vCPU during compilation, but failures here are less likely. Since koji doesn't offer build logs for the restarted builds, I can't know for sure where it failed.)

There may be something going on with s390x again too.. I see:

✗ koji list-tasks --host buildvm-s390x-25.s390.fedoraproject.org; koji list-tasks --host buildvm-s390x-20.s390.fedoraproject.org
ID Pri Owner State Arch Name
108911307 19 pwalter OPEN s390x buildArch (webkit2gtk4.0-2.42.2-1.fc40.src.rpm, s390x)
ID Pri Owner State Arch Name
108910932 19 pwalter OPEN s390x buildArch (webkit2gtk4.0-2.42.0-1.fc40.src.rpm, s390x)

and those don't appear finishing correctly either.

I'll try and come up with some plan... :(

It's hard to say what's going on without logs :( Note that webkitgtk on F40+ should be much easier on the builders because a third of it, webkit2gtk4.0 was dropped/split out into its own srpm.

@kevin Was there not some kind of zram issue on builders that kept creeping back in a while back?

As much as I can tell all of the fixes for debuginfo extraction are still in place. Another piece of the puzzle was dwz 0.15 that started doing more parallelism and needed https://sourceware.org/pipermail/debugedit/2023-January/000173.html but that shouldn't be relevant here because F37 still has dwz 0.14.

%global _find_debuginfo_opts %limit_build -m 32768

I'll note that the ppc64le builders have only 20 GB of RAM so the previous setting of %limit_build -m 16384 was already enough to get -j1 passed to debuginfo extraction and all parallelism turned off.

So, I see the ones from 6 days ago finished ok? Are we still seeing this?

the scratch build on s390x is still stuck: https://koji.fedoraproject.org/koji/taskinfo?taskID=108910932
and I can't ssh into that builder... it's completely unresponsive. ;(

So, I see the ones from 6 days ago finished ok?

Yes, although that was a rawhide build, which is less resource-intensive since in rawhide WebKitGTK is only built twice, instead of three times as in the stable Fedoras.

Are we still seeing this?

I haven't seen any problems since I reported this, but I also have not been submitting new builds.

the scratch build on s390x is still stuck: https://koji.fedoraproject.org/koji/taskinfo?taskID=108910932
and I can't ssh into that builder... it's completely unresponsive. ;(

Oh nice, I see that build has been running for more than two weeks now. Didn't know about that one. I'd say something is wrong. :)

If ssh is completely unresponsive, I think OOM is the most likely cause.

Probably builds should time out and fail after a couple days, so they don't just continue churning forever like this?

Yeah, there is supposed to be a time limit, but it seems like it's a per build attempt one... so if it restarts that timer restarts. ;(
Can file a koji bug on it.

Is this still happening? I see a number of successfull builds...

Same answer from me as before: I haven't seen any problems since I reported this, but I also have not been submitting new builds.

If the configuration has not changed, then it's probably going to happen again.

Also there is the matter of https://koji.fedoraproject.org/koji/taskinfo?taskID=108910932. It just shouldn't be possible for a build to last three weeks. I guess any arbitrary timeout might be too small if there is a sufficiently-large queue of other jobs, but maybe three days would be better?

Yeah, what happens there is apparently that the build makes the builder completely unresponsive except it still responds to ping. It stops checking into the hub, I cannot login to it even on the (virtual) console. So I guess the hub just assumes it's ongoing without any infromation.

/me goes to rebuild 2 of the s390x buildvm's in this state. ;(

So... the package name changed?

We have in koji config:

source */webkitgtk* :: use heavybuilder

this does not match webkit2gtk4.0, so it was getting a 'normal' builder.

Whats the glob we should be using here to ensure all the webkit builds go to heavybuilder?

It failed (because it's buildroot was now no longer current). I resubmitted it, but it failed again due to rawhide package changes.

The next time it builds, let me know and I can manually assign it to heavybuilder builders (or @kalev can).

CC: @pwalter

The source package was previously named webkitgtk4, then renamed to webkit2gtk3, then most recently renamed to webkitgtk.

The binary packages all have different names, but surely those shouldn't affect scheduling as they're all built in one job on the same builder.

Actually, I see pwalter introduced the webkit2gtk4.0 package, which I didn't know about. This doesn't seem useful because all packages that depend on it are going to be removed from Fedora 40 as part of https://fedoraproject.org/wiki/Changes/Remove_webkit2gtk-4.0_API_Version. Package maintainers might not notice that their packages are scheduled for removal so long as the dependency is present. I had originally planned to implement the change by just allowing the packages to fail to build and eventually be retired, but that won't work anymore.

(We are removing packages that depend on both WebKitGTK and libsoup 2 as a security measure because libsoup 2 is a security-sensitive HTTP library that is barely maintained and WebKitGTK's network process is unsandboxed. I wanted to remove everything that depends on libsoup 2, but FESCo rejected that change.)

My recommendation now is to retire that package, as that's the simplest way to implement the change proposal.

I'd rather not defer implementation of the change proposal to F41, but we might need to as I'm no longer confident that applications have had sufficient time to prepare for removal of webkit2gtk-4.0. @ngompa your opinion would be helpful.

@kevin Thank you for adding me. I did not know that webkitgtk can use heavybuilder. If you can add webkit2gtk4.0 to the list that would be helpful.

@catanzaro It was agreed in https://pagure.io/fesco/issue/2984#comment-855091 that it is fine to add the compat package. You clearly stated that you do not want to continue maintaining the libsoup 2 webkit port and I picked it up where you left it. Please do not be an ass about it now and force removal of a lot of packages. The packages were only going to be removed if nobody created a compat package as agreed.

@catanzaro Why did you retire https://src.fedoraproject.org/rpms/wpebackend-fdo/c/39f691ba268380a7c22752b8c5b79eafc510dd4e?branch=rawhide and https://src.fedoraproject.org/rpms/libwpe/c/185de91b338b0a0f245041883e63f1e60976a7b0?branch=rawhide and maybe more packages without notifying the devel list? This is not OK. Please ask releng to revert the retirement to avoid breaking webkit2gtk4.0. Feel free to orphan the packages afterwards and or assign them to me.

Thank you and have a nice day.

It was agreed in https://pagure.io/fesco/issue/2984#comment-855091 that it is fine to add the compat package.

Ah, and it's even me who said that. Well, OK then. Still, forcing removal of abandoned/unmaintained packages was my goal. I emphasize that this library is security-critical; if you keep it around, it will become like QtWebKit eventually (albeit the situation is not nearly that bad yet, since WebKitGTK upstream still supports libsoup 2).

So, let me know what apps you're concerned about. Probably it will be easy to port them to libsoup 3. Hopefully.

Why did you retire https://src.fedoraproject.org/rpms/wpebackend-fdo/c/39f691ba268380a7c22752b8c5b79eafc510dd4e?branch=rawhide and https://src.fedoraproject.org/rpms/libwpe/c/185de91b338b0a0f245041883e63f1e60976a7b0?branch=rawhide and maybe more packages without notifying the devel list? This is not OK. Please ask releng to revert the retirement to avoid breaking webkit2gtk4.0. Feel free to orphan the packages afterwards and or assign them to me.

I retired them because WebKitGTK doesn't need them anymore and there's no point in having unused dependencies in the distro. I never imagined that there was another WebKitGTK I didn't know about. They are being obsoleted in WPE WebKit too. Fortunately, there's no need to bring them back; just update rawhide to the latest WebKitGTK 2.43 unstable version, which should go smoothly, and you'll be good to go.

Fortunately, there's no need to bring them back; just update rawhide to the latest WebKitGTK 2.43 unstable version, which should go smoothly, and you'll be good to go.

Actually, you don't even need to update. I think they are not used anymore as of 2.42; the code just wasn't removed yet. You can safely build with -DUSE_WPE_RENDERER=OFF.

ok. I added that package to the hub config.

So, I guess we leave this open to track issues on the next webkitgtk build? Or should we close it and you can re-open if you are seeing issues?

Let's close this for now. That said, since we didn't make any configuration changes to the ppc64le builder, I expect we'll run into trouble again....

Metadata Update from @catanzaro:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

9 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog