#11822 Numerous Packages in Fedora fail `git-fsck`
Opened 2 years ago by sgallagh. Modified 2 years ago

Describe the issue

Many packages (mostly Java) in Fedora dist-git fail to validate with git-fsck. This prevents importing those packages' history into forges that run this check on push for security reasons. In particular, this is preventing us from importing those packages into CentOS Stream 10, which uses Gitlab as the host for its dist-git.

The packages specifically needed for CentOS Stream 10 are:

antlr
apache-commons-cli
apache-commons-codec
apache-commons-exec
apache-commons-io
apache-commons-parent
beust-jcommander
bsf
jdom
jsch
log4j
maven-antrun-plugin
maven-archiver
maven-assembly-plugin
maven-shade-plugin
mojo-parent

The rest of the list is attached below. All of the affected packages have the same root issue: a packager many years ago had an extra < character in their author/committer field, which causes a (harmless) validation error.

What we propose:

  1. Archive the current head of the rawhide branch with a new branch/tag identifier so that it will remain in the repository.
  2. Rewrite history from the affected commit forwards, correcting the author/committer field so that the history on the rawhide branch will validate successfully. This means that rawhide will move HEAD and will need to be announced that anyone with a checkout will need to re-pull from it.
  3. Import the history from the fixed Rawhide branch into CentOS Stream 10.

This will resolve the issue for these sixteen packages, but it's also worth noting that Fedora will still need to deal with this issue on other branches at some point if we move away from Pagure. We probably also want to bring Pagure in line with Gitlab here and make it run this check on pushes to ensure we don't reintroduce similar issues.

When do you need this? (YYYY/MM/DD)

Immediately

When is this no longer needed or useful? (YYYY/MM/DD)

This should be a one-time event.

If we cannot complete your request, what is the impact?

We will be unable to import the affected packages directly to CentOS Stream 10 and will need to find another workaround, probably making trivial merges/cherry-picks from Fedora unusable.

Appendix

The complete list of Fedora packages that fail git-fsck as of 2023-12-04.

ant-contrib
antlr
apache-commons-cli
apache-commons-codec
apache-commons-dbutils
apache-commons-digester
apache-commons-discovery
apache-commons-exec
apache-commons-io
apache-commons-parent
apache-commons-pool
apache-mime4j
beust-jcommander
bsf
cal10n
castor
checkstyle
cssparser
decentxml
derby
eclipse
eclipse-cmakeed
eclipse-egit
eclipse-jgit
eclipse-manpage
eclipse-pydev
eclipse-rpm-editor
eclipse-rpmstubby
eclipse-shelled
felix-osgi-foundation
geronimo-jaxrpc
geronimo-osgi-support
geronimo-saaj
gstreamer-java
httpunit
icu4j
javamail
jaxen
jboss-parent
jdom
jline
joda-time
jsch
kxml
log4j
maven-ant-plugin
maven-antrun-plugin
maven-archiver
maven-assembly-plugin
maven-checkstyle-plugin
maven-doxia-tools
maven-eclipse-plugin
maven-help-plugin
maven-idea-plugin
maven-javadoc-plugin
maven-plugin-exec
maven-pmd-plugin
maven-release
maven-repository-plugin
maven-shade-plugin
maven-skins
mojo-parent
msv
mx4j
nekohtml
plexus-active-collections
plexus-interactivity
rpmorphan
sqljet
xdoclet
xmltool

We should fix this if for nothing else because Pagure's remote pull request feature is effectively useless on these packages if you can't push to some other git server because it fails git-fsck.

I would want this to get FESCo approval and also get a lot of noise in the community before we do it... since all those repos checked out by maintainers would be useless/broken after we re-write history. ;(

Metadata Update from @phsmoura:
- Issue tagged with: high-gain, high-trouble, ops

2 years ago

OK, with the time it would take to get approved by FESCo and then communicate it effectively to the community as a whole, I think we (CentOS Stream 10) need to proceed with our backup plan of just breaking the inheritance and importing these with revised history. I'll leave this ticket open because we do still need to come up with a longer-term solution.

Well, we still have a month, so a FESCo ticket about this now (to be discussed/voted on by Thursday) and then an announcement shortly after can give us the time we need.

Please be diligent here.
"Fedora rewrites dist-git history to make RHEL's choice of non-opensource gitlab.com work" would be a very bad headline for all of us. And it's a predictable headline on you_name_it.
The issue of fsck-protected pushes is too technical to make it even in the byline on those sites ...

There are good reasons for both the non-rewrite policy as well as the fsck-protection. They are in conflict here. If we were in control of both forges the obvious short-term solution would be to check the fsck warning carefully and then override it. We cannot do that because gitlab.com does not let us do that. And it's really disappointing, given git has fsck.skipList specifically for that purpose, with fine-grained control.

I'm wondering, though - where does CentOS Stream 9 have its package sources, and why hasn't the problem surfaced there?

Does gitlab.com offer shallow clones? This would allow us to get the history from 2012 onwards (I checked bsf only), which should suffice for now - unwritten and fsck'able!

.mailmap does not help fsck, btw (only log). Neither does git replace.

I'm wondering, though - where does CentOS Stream 9 have its package sources, and why hasn't the problem surfaced there?

CentOS Stream 9 also has it on gitlab.com, but it doesn't surface there because they didn't import Git history when they forked from Fedora Linux 34. They dropped all the history of the package and created new repos. However, this messes with attribution, especially for packages that use RPMAutoSpec. Back then, RPMAutoSpec was new and not broadly adopted, but now it is used in many core packages. Thus, the community pushed for preserving the Git history for packages when forking from Fedora the next time. That time is now.

For the record, I've imported the sixteen CentOS Stream 10 branches by running the following magic incantation over them:

git filter-repo --force --email-callback 'return email.replace(b" <akurtako@redhat.com", b"akurtako@redhat.com")'

All of the issues were the result of an extra pair of characters in the email field for a short period around 2008 (likely in CVS and pulled over when we switched to git): < <akurtako@redhat.com> instead of <akurtako@redhat.com>

If FESCo approves the fixup, I can pull these 16 branches (and their updated history) back over to Fedora easily. I went ahead with the CentOS Stream 10 import right away because they are blocking our switch over to using CS10 as its own buildroot (rather than relying on ELN for the buildroot).

For reference, the corresponding FESCo ticket is https://pagure.io/fesco/issue/3119

Log in to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog