As part of Fedora ELN work to prepare for CentOS Stream 10 branching from Fedora Linux 40, @sgallagh discovered that there are a number of package git repositories that fail git-fsck. The list of packages are detailed in releng#11822 (to sum up: mostly Java packages that have existed since the beginning of Dist-Git), and the consequences of this are pretty severe:
git-fsck
The ELN SIG has requested a course of action to fix this to Release Engineering, but Release Engineering would like FESCo approval for this as well as a broad announcement that this is happening and why.
I'm marking this for fast-track given the urgency and timeframe needed to resolve this for the SIG.
Proposal: FESCo approves this one-time effort to fix these packages provided there's an announcement from the ELN SIG to the community to inform everyone of what's going on, why, the impact, and how packagers should respond to this.
Metadata Update from @ngompa: - Issue assigned to sgallagh
Just for the record, because it's not mentioned explicitly in the proposal above: This is about rewriting a decade of git history (including git pre-history imported from cvs) for the affected repositories, changing most commit hashes, etc.
However, since part of the proposal is to archive the current rawhide HEAD to ensure mapping koji builds etc. to commit hashes is still possible, I am +1 for this one-time effort, since broken git history is a problem in general.
So, a clarifying statement: If we archive the current rawhide HEAD, we need to do it "somewhere else". As long as those faulty commits exist in Fedora's dist-git (even on a different branch), it becomes impossible to fork the repo or migrate it to a new hosting site.
For the record, I've already taken steps to import fixed branches of those sixteen CentOS Stream 10 packages in the releng ticket into Gitlab as the c10s branch over there. I can import branches those back to Fedora fairly easily once we decide if and how we will archive the existing Rawhide branches somewhere. I didn't touch the ones that were Fedora-only (yet).
Metadata Update from @zbyszek: - Issue tagged with: meeting
Consider me a +1 to the proposal, but my one question is where is "somewhere else" and who is the steward of that archived data? I'd like to see that documented or captured somewhere should we need to dig in to the archived data later.
my one question is where is "somewhere else" and who is the steward of that archived data?
That's a good question. My proposal: export the .git directory from before the rewrite and save it a file in git history of the repo after the rewrite:
.git
fedpkg clone xmltool cd xmltool tar -Jcvf xmltool.git.tar.xz .git git filter-repo ... git add xmltool.git.tar.xz git commit -m 'History rewrite: save previous .git directory [skip changelog]'
This way it will never get lost, we don't need a new "place" to store things, and anyone can trivially dig into the history if they need to.
I'd much prefer to create a new rpms-archive distgit namespace and push the old repos there, but if you insist, it would make more sense to put that archive on a separate git checkout --orphan archive branch to avoid confusion or the accidental deletion of the archive.
rpms-archive
git checkout --orphan archive
Metadata Update from @zbyszek: - Issue untagged with: fast track
This was discussed in today's meeting, but we didn't reach any conclusions.
I'd much prefer to create a new rpms-archive distgit namespace and push the old repos there
That is certainly a possibility. But I think that'd be overkill. We're unlikely to ever need to look at those repos. The rewrite is a rather trivial adjustment of the email address.
it would make more sense to put that archive on a separate git checkout --orphan archive branch to avoid confusion or the accidental deletion of the archive.
OK, I like that. (The archive cannot be deleted, because it's attached to a git commit, but yeah, it seems nicer.)
Updated proposal:
fedpkg clone xmltool cd xmltool tar -Jcvf xmltool.git.tar.xz .git git filter-repo ... git checkout --orphan archive git add xmltool.git.tar.xz git commit -m "Save previous .git directory before rewrite on $(date +%F)" git switch -
@sgallagh It is possible to have heads that are not cloned automatically. I expect Gitlab to clone refs/heads/* only, so you could have a branch refs/archive/rawhide with the old contents. This would also exclude it from manual git clone, which could be a good thing or a bad thing depending on how you look at it.
refs/heads/*
refs/archive/rawhide
git clone
Please don't put git in tarballs in git.
Please put it where we always put such things, i.e. in archive/ see https://pagure.io/releng/issue/7265
That is indeed nicer. The only disadvantage is that git fsck would fail on a system which has the archive branch. But if we don't have plans to run fsck there, that's doesn't matter.
git fsck
fsck
So reading back through this and thinking about the issue ("All of the affected packages have the same root issue: a packager many years ago had an extra < character in their author/committer field, which causes a (harmless) validation error."), I am not convinced this is something that requires the heavy hammer of tarring up the history and archiving it and moving forward. Git has ways to correct errors in author and committer fields. I've done this before, especially when a contributor wants to change their email address. The actual commit IDs remain in place so that existing clones and forks work.
I would strongly prefer us exploring 'git commit --amend' and 'git-filter-repo' to correct the known issues rather than archiving history.
That is how we would fix it, but all the commit hashes will change in the branch where we do this, so we need the old commits archived somewhere in case they need to be pulled.
I would strongly prefer us exploring 'git commit --amend' and 'git-filter-repo' to correct the known issues rather than archiving history. That is how we would fix it, but all the commit hashes will change in the branch where we do this, so we need the old commits archived somewhere in case they need to be pulled.
Right, the committer field is part of what is hashed to create the commit ID. You can absolutely amend or filter-repo to fix things up, but it DOES rewrite the history from that point. In order to retain the original commits, we need to store them somewhere.
FYI, the exact command I ran to fix this up for CentOS Stream 10 was:
git filter-repo --force --email-callback 'return email.replace(b" <akurtako@redhat.com", b"akurtako@redhat.com")'
There isn't a high urgency on this at the moment, so I'm taking it off the meeting agenda.
Metadata Update from @sgallagh: - Issue untagged with: meeting - Issue tagged with: stalled
I tested the solution proposed by @fweimer and it works as advertised.
Test:
fedpkg clone xmltool && cd xmltool cp -av .git/refs/heads archive # note that 'archive' must be *outside* .git/ so it doesn't get rewritten git filter-repo --force --email-callback 'return email.replace(b" <akurtako@redhat.com", b"akurtako@redhat.com")' mv archive .git/refs/
After that, when the repo is cloned, we don't get the old branches. I checked that the original branches cannot be pushed to gitlab, but the new ones can. The old branches can be referred to via git rev-parse archive/f37 and similar.
git rev-parse archive/f37
(EDIT: Note that this command is intended to be invoked in the "upstream" dist-git repo. When testing locally, after e.g. fedpkg clone, one has to first generate local branches, e.g. via for i in {14..37}; do git checkout f$i;done.)
fedpkg clone
for i in {14..37}; do git checkout f$i;done
PROPOSAL: FESCo approves the rewriting the history in those repositories using git filter-repo, with the old branches saved to refs/archive/ namespace.
git filter-repo
refs/archive/
Metadata Update from @zbyszek: - Issue untagged with: stalled
Fantastic!
Does this mean we'll need to tell people: "Just do a fresh checkout of the affected packages"?
+1
That is the easiest option and I think we should recommend that. Otherwise, you need to do a git pull (or git pull --rebase if the clone is old enough, because the default changed some time ago) in each branch before using it.
git pull
git pull --rebase
After two weeks and some change, the result is: APPROVED (+4, 0, 0)
Metadata Update from @zbyszek: - Issue tagged with: pending announcement
Announced: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/VFEB3T7O24UOPB6M3H5DFDOJS36PXEGW/.
Metadata Update from @zbyszek: - Issue untagged with: pending announcement - Issue close_status updated to: Accepted - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.