#3291 Advise around a ROCm llvm fork
Closed: Accepted 3 months ago by kevin. Opened 5 months ago by mystro256.

Hello,

I am one of the maintainers of rocm-compilersupport and I have a question around the future maintainability of this package.

Upstream ROCm uses a fork of LLVM 18 and we unbundled it with the help of the Fedora LLVM maintainer team, who are maintaining a compatibility package for N-1 (currently llvm18 in rawhide). This seemed fine given upstream ROCm's tendency to only lag behind 1 major release of LLVM, but it seems upstream ROCm is not committing to updating to LLVM 19 by the time that LLVM 20 will become available in Fedora.

Given upstream ROCm's lack of commitment on when the upgrade to LLVM 19 will happen, and the complexity of upgrading current ROCm packages to be compatible with LLVM 19, we might need to continue to use LLVM 18 for some time.

I spoke to the LLVM maintainers, and it seems they want to drop llvm18 since upstream LLVM is no longer maintaining the llvm 18 release branch or will be dropping soon. So my thought process is that we should just build the rocm fork of llvm instead of taking over llvm18 since the former is still maintained. I spoke to the other maintainers and we wish to continue to build this in rawhide until the LLVM situation improves with ROCm, which could span multiple Fedora releases in the future.

This also has other benefits, as ROCm's LLVM fork has fixes and features earlier than upstream LLVM. We also have a few bugs caused by not static linking libclang, which the Fedora maintainers do not wish to maintain, but the ROCm sig is fine maintaining (dynamic linking has symbol clashes with the latest Fedora clang package). Finally, landing the planned ROCm 6.2 into Fedora 41 was also very rocky since llvm18 landed during beta freeze and with bugs that prevented us from fixing ROCm in time for the F41 release.

I know that bundling is now a "should not" not a "must not" anymore, so I assume we don't need a FESCo sign off, but any thoughts from FESCo would be welcome before we proceed.

Thank you


@trix has some work in progress patches. He tried to reduce the scope of what llvm libs we are building/packaging as much as possible to avoid unintended use outside of ROCm related usecases.

See the spec file:
https://src.fedoraproject.org/rpms/rocm-compilersupport/blob/rawhide/f/rocm-compilersupport.spec
and there's several "with bundled_llvm" conditions in there.

I still need to go in and clean up/provide feedback. I believe he has a copr available, which I can ask him for if anyone is interested in looking further into it.

The bundled_llvm is off for fedora, for other disto, it is on. Rawhide is tracked with this COPR https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/RH/

For CentOS stream bundled_llvm is on https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/CS10/

For Suse Tumble Weed bundled_llvm is on
https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/TW/

You have my blessing ;) More seriously, it sounds like you have a handle on things and have considered the pros and cons of both approaches. FESCo doesn't have the detailed knowledge that would be useful here. This needs to be discussed and decided between the maintainers of ROCm and LLVM. As you wrote, anti-bundling rules are advisory, so if the maintainers of the package think that bundling is more maintainable, that is the way to go.

+1 to what zbyszek says. We don't have detailed knowledge, but from what is presented, bundling (for now) seems like it would reduce maintenance effort /and/ provide a less buggy experience. Hopefully upstream would eventually track upstream LLVM more closely and drop their fork (though looking at the competing CUDA that might be a long time away)

I think there may be some confusion here. We don't drop any of the compat packages (like llvm18) if they still have other packages that depend on them. We do orphan them, but anyone can pick up ownership if they want. For example Fedora still carries the llvm11 package which is about 5 years old.

In my opinion, a proposal like this should be an official Fedora change proposal and needs to include more specifics about which libraries and tools are being bundled.

We also have a few bugs caused by not static linking libclang, which the Fedora maintainers do not wish to maintain, but the ROCm sig is fine maintaining (dynamic linking has symbol clashes with the latest Fedora clang package).

If these bugs are not fixed by this update, please provide feedback and steps to reproduce.

Finally, landing the planned ROCm 6.2 into Fedora 41 was also very rocky since llvm18 landed during beta freeze and with bugs that prevented us from fixing ROCm in time for the F41 release.

I think we, ROCm and LLVM maintainers, can do a better job working together to ensuring a more smooth release. There is a very detailed release plan with milestone dates for LLVM in Fedora, and packages are available in COPR that can be used for testing before they land in Fedora proper. It should be possible to test things out early or even provide CI tests that we can run with our updates.

Overall, I am opposed to this proposal. I think all the issues mentioned can be resolved by having better collaboration between the two teams.

@tstellar I think there's two points I would like to touch on:

First, I'm not sure I understand orphan vs abandon here. I mean if we're going to maintain llvm18, wouldn't it make more sense to use an upstream that isn't inactive? Or is upstream LLVM still going to maintain release/18.x branch for critical fixes?

Second, can we propose using a model more similar to SUSE, where packages like "llvm" and "clang" are just metapackages pointing to the latest (i.e. llvm19 and clang19)? I think some of the issues in the past release came from the late arrival of llvm18, the gap between llvm upgrading to 19 and llvm18 landing in f41, as well as some bugs introduced in the repackaging of llvm18.

If llvm18 was available from the start, where llvm was a metapackage requiring llvm18 (same for clang, compiler-rt, lld), we wouldn't have had to scramble to change our packages to accommodate the transition.

To answer your other question, I believe ROCm uses:
llvm, clang, compiler-rt, lld, openmp

The ROCm LLVM tree is a fork of LLVM's 18's tag llvmorg-18-init (branch point for LLVM 18 release branch), with fixes and backports from main. The llvm18 package not a clean replacement, as there is some variance between llvm18's and rocm's llvm's API, so I do maintain a few patches for compatibility. I've been wanting to work closer with upstream to avoid these API differences.

Furthermore, there is some new HW lag by using an old LLVM (as opposed to the ROCm fork), which is unavoidable.

LLVM release very quickly, every Fedora release, and unfortunately these land in the release after branch and freeze point. For a while ROCm's llvm was able to keep up with at least then N-1 release. But with the current 6.2 and next 6.3 ROCm releases, ROCm has slipped to N-2. Once slipped, because of the speed of the LLVM releases, I do not believe ROCm will ever be at N-1 again.

The ownership of the to-be orphaned llvm18 was discussed in the ai/mi meeting https://board.net/p/fedora-aiml-sig-meeting-agenda 11/21. No one was willing to pick up these packages. Having slipped, ROCm will need llvm19 in F43, llvm20 in F44 etc. at the time of these release these will also be orphaned and will need someone to pick them up.

rocm-llvm is not replacing llvm compat. It is being used as it is in the upstream, for build ROCm packages and supporting user hip and rocm programs. Though it does follow the best practices of how llvm18 is created, it's install location is /usr/lib64/rocm/llvm. The user interaction is through the existing wrapper /usr/bin/hipcc. The existing libcomgr, that applications like blender use is unchanged.

When llvm18 is orphaned, late in the F42 release, the ROCm feature and it's users like bender, pytorch and ollama will be broken. llvm-rocm has been in testing for the last month with the current ROCm 6.2 The switch over is planned when ROCm 6.3 is released, likely this month, so it will have the maximum amount of testing before the F42 branching.

My concern is mostly that llvm18 is effectively a dead upstream. I haven't seen an updated tag/release for 6 months, and I don't expect to see one in the lifetime of Fedora 42.

We could pick up llvm18, but I can't honestly see myself doing anything other than keep it building and possibly trying to backport fixes from ROCm LLVM. Using a non-dead upstream like ROCm LLVM, seems more appealing to me. If someone else was maintaining llvm18, I would be less hesitant, as it shows others are still using it and fixing it.

In regards to trix's comment, it's unclear to me if ROCm will be skipping 19 in favour for 20, or if we're going to see further lag with LLVM releases.

My concern on picking up llvm18 is supporting not just rocm, but the general use of llvm and clang. A dead upstream project will not be fixed upstream, so to fix for example a c++ issue, would need to be handled in Fedora by someone familiar with c++ frontend. So it would imo be better to let llvm18 orphan than to claim it is supported and to do a poor job.

When ROCm does catch up to our currently supported LLVM, do you intend to unbundle again? Or will you bundle everywhere regardless? I ask especially because ROCm has been getting rebases on stable Fedora branches as well, where LLVM 18 is still supported by our maintainers.

@jistone I think the problem is that jumping between bundled and unbundled llvm is difficult for maintainership. There's also the issue that ROCm's LLVM has backports from LLVM 20 right now, so even if ROCm jumps to LLVM 19 through the f42 development cycle, we can't necessarily unbundle without feature loss or unforeseen regression.

It kind of puts us between a rock and a hard place.

Bundling until the situation improves is sort of the "easy path", but even "improves" is subjective.

Ideally ROCm upstreaming everything to LLVM proper would be best, but it doesn't seem like this has a lot of traction right now with the developers. From what I understand the ROCm specific out of tree LLVM components are not up to the llvm code standards from a style and architectural/design point of view.

Building in Fedora as upstream ROCm provides/intends it (i.e. with bundled llvm) allows us to share issue reports more readily with ROCm upstream, but it right now we seem to have a broken pipe where issues can get stuck in the triage if it appears to be compiler related, and the experience isn't the same as using the AMD provided packages (such as the ones for RHEL).

The alternative to bundled llvm would be a dead upstream, and as trix put it, possible expections from users who want to use it for other reasons.

I think I'll reach out to ROCm upstream for their thoughts, but feel free to chime in if you have anything to add.

From a practical standpoint, it seems to me that ROCm's fork of LLVM is:

  • Functionally maintained as a separate project from LLVM upstream
  • Useful ONLY to ROCm
  • Will be maintained as long as ROCm needs it

With that in mind, I'm +1 to granting them a bundling exception, since I think we really DON'T want anyone else to use this fork in Fedora.

And yes, in an ideal world ROCm upstream would be using only unmodified LLVM. I hope they get there some day.

With that in mind, I'm +1 to granting them a bundling exception,

This isn't even needed since the bundling guidelines were relaxed a few years ago.

But sure, in this case, +1 ...

From a practical standpoint, it seems to me that ROCm's fork of LLVM is:

  • Functionally maintained as a separate project from LLVM upstream
  • Useful ONLY to ROCm
  • Will be maintained as long as ROCm needs it

In my opinion, these are valid reasons to bundle, however, this needs to be articulated better in the proposal. Most of the justifications given for bundling seem to be related to the perceived low-quality of the existing llvm18 package and its lifecycle.

I think it would be helpful to add to the proposal:

  • More information about what is different between ROCm's LLVM fork and the upstream LLVM (i.e. what features, bugfixes etc. that ROCm's fork has that upstream LLVM does not)
  • A detailed description of which LLVM components will be bundled and where on the filesystem they will live.
  • A test plan to ensure that the bundled llvm won't conflict with the other llvm packages on the system (so we can avoid issues like this https://bugzilla.redhat.com/show_bug.cgi?id=2324076)

With that in mind, I'm +1 to granting them a bundling exception,

This isn't even needed since the bundling guidelines were relaxed a few years ago.

But sure, in this case, +1 ...

Sorry, I meant to say "I'm +1 to recommending the use of bundling for this case".

From a practical standpoint, it seems to me that ROCm's fork of LLVM is:

  • Functionally maintained as a separate project from LLVM upstream
  • Useful ONLY to ROCm
  • Will be maintained as long as ROCm needs it

In my opinion, these are valid reasons to bundle, however, this needs to be articulated better in the proposal. Most of the justifications given for bundling seem to be related to the perceived low-quality of the existing llvm18 package and its lifecycle.

The quality of llvm18 is fine now because it is backed up by experienced folks both in the mainline upstream and in fedora But when it is orphaned it will not be backed up by either. rocm-llvm will be fine for rocm because it will be backed up by AMD's experienced ROCm compiler folks and in fedora. For me this is about managing risk and uncertainty and to take responsibility for the parts that I can and to not over promise on things I can not. I will take responsibility for llvm-rocm on fedora.

I think it would be helpful to add to the proposal:

I have the omp package in development now, it will benefit HPC users
rocMLIR will follow, some work as been done by conditionally enabling the mlir target in rocm-llvm. It will benifit some AI use cases.

The 'else' side of this conditional can be seen in the Centos stream 10 and Suse Tumbleweed COPR's I referenced above.

The prefix for the bundled install is in an existing ROCm owned, non system location here
https://src.fedoraproject.org/rpms/rocm-compilersupport/blob/rawhide/f/rocm-compilersupport.spec#_35

It is /usr/lib64/rocm/llvm/{lib,bin}

There is nothing like alternatives magic to make this easy to find and use. The user will either use the wrapper /usr/bin/hipcc or manually set up to a path that is obviously not the system compiler or the compat system compiler.

I don't see any provides/requires filtering there... do you have an example build I can look at?

You probably should have __provides/requires_exclude_from on that rocm private path:
https://docs.fedoraproject.org/en-US/packaging-guidelines/AutoProvidesAndRequiresFiltering/

There are the Centos Stream https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/CS10/
And Tumbleweed https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/TW/
These are both derivatives of Rawhide.

I'll look at adding those filters.

Thanks! It looks like the libraries have "18git" in their sonames, so they won't (currently) conflict with our normal packages, but they should still be filtered from rpm Requires/Provides.

I think your Provides: bundled(llvm-project) isn't working either, because you're not actually shipping a top-level rocm-compilersupport package. Instead, I think that should be repeated on each of the relevant subpackages.

I see the filtering was added in https://src.fedoraproject.org/rpms/rocm-compilersupport/c/babecbfaab4af45ff8ccb64487533cc51bb75837 and removed again in https://src.fedoraproject.org/rpms/rocm-compilersupport/c/ecdb8550453a564772059422bf611811c0d2f433. Anyhow, I assume that the Provides are under control.

I'm +1 to recommending the use of bundling for this case

+1 from me too. That makes it +3.

I'd like to close this somehow. Other FESco members, please vote on the recommendation.

I'm +1 to recommending the use of bundling for this case

+1 from me too. That makes it +3.

I'd like to close this somehow. Other FESco members, please vote on the recommendation.

Technically, according to the FESCo voting policy, this has passed two weeks with at least one +1 and no -1s, so it's approved.

I think that https://pagure.io/fesco/issue/3291#comment-947429 is not clear enough to be treated as an official proposal submitted for a vote and that we should wait a week from today.

I think that https://pagure.io/fesco/issue/3291#comment-947429 is not clear enough to be treated as an official proposal submitted for a vote and that we should wait a week from today.

Does this mean there will be an official change proposal for this change?

Oh, I meant to reply to that part, but then forgot.

You wrote:

In my opinion, a proposal like this should be an official Fedora change proposal and needs to include more specifics about which libraries and tools are being bundled.

I disagree. The Change process is for user visible features or things that other maintainers need to know about. Whether a package is built using this or that compiler doesn't meet the criteria. In fact, all people who need to know about this are probably subscribed to this ticket. Creating a Change page is a lot of extra work and I don't think it's justified to require it, not would such a page be even useful.

Yeah The intention for this was not a change proposal, more for advise, since A) it should be transparent to the user and other maintainers, and B) does not require any exception anymore. We can close it if no other comments are needed.
I need to sync with @trix about the details but we're going ahead with it. I believe we've trimmed out any provides/requires, but feel free to me let me know if there's issues. See:
https://src.fedoraproject.org/rpms/rocm-compilersupport/c/babecbfaab4af45ff8ccb64487533cc51bb75837?branch=rawhide

Also I missed @jistone 's comment, do we need to have the bundled provides on the binary packages? I thought it was just to be added to the spec file for tracking on the source level. I guess that does make sense to add it to one of the binary package. Anyway it's an easy fix, so just let me know.

It needs to exist in the binary packages, yes.

I disagree. The Change process is for user visible features or things that other maintainers need to know about. Whether a package is built using this or that compiler doesn't meet the criteria. In fact, all people who need to know about this are probably subscribed to this ticket. Creating a Change page is a lot of extra work and I don't think it's justified to require it, not would such a page be even useful.

OK, we can disagree on this. I think having a write up would be useful to help clarify exactly what is being changed. For example, this proposal isn't about changing the compiler used to build the packages, it's about changing the libraries that the ROCm packages link against at runtime.

Reminder: voting concludes in two days and there have been no additional votes since the request to grant an extra week. It's currently at (+3, 0, -0).

+1 for the bundling exception

So, thats +4 then... so this is approved.

Make sure you comment heavily any specs.

Metadata Update from @kevin:
- Issue close_status updated to: Accepted
- Issue status updated to: Closed (was: Open)

3 months ago

Log in to comment on this ticket.

Metadata