I'm not sure how this can happen:
Errors during downloading metadata for repository 'http_kojipkgs_fedoraproject_org_repos_rawhide_latest_basearch': - Status code: 404 for https://kojipkgs.fedoraproject.org/repos/rawhide/latest/x86_64/repodata/d5852081fe17f07cdbeae5e4b6f11cae2e760f83c5c011769e10b9b2842c9e63-filelists.xml.gz (IP: 38.145.60.20) - Status code: 404 for https://kojipkgs.fedoraproject.org/repos/rawhide/latest/x86_64/repodata/4a9333dbad0f011c9b5ce60fee4bddc71c1558a320a910ed9753fb604be49c98-primary.xml.gz (IP: 38.145.60.20) Error: Failed to download metadata for repo 'http_kojipkgs_fedoraproject_org_repos_rawhide_latest_basearch': Yum repo downloading error: Downloading error(s): repodata/4a9333dbad0f011c9b5ce60fee4bddc71c1558a320a910ed9753fb604be49c98-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/d5852081fe17f07cdbeae5e4b6f11cae2e760f83c5c011769e10b9b2842c9e63-filelists.xml.gz - Cannot download, all mirrors were already tried without success WARNING: Dnf command failed, retrying, attempt #2, sleeping 10s ...
Copr tries 3 times (about a minute) and then fails. Any ideas? It looks like there's the 'repomd.xml' file for a longter time period which points to the outdated metadata files that are already removed. But that would be weird... perhaps caching?
Note that this repodata is regnerated all the time by kojira. Whenever anything changes in the buildroot.
If it gets the repomd.xml and then tries to get those other repodata files based on the repomd.xml, it might be the repodata has changed since then, and it needs to retry re-reading the repodata.xml.
Not sure how to handle this any better. ;(
The old repodata is kept in https://kojipkgs.fedoraproject.org/repos/f36-build/ ie, it moves the existing one, makes a new one and points 'lates' to it. Perhaps something could be done with that?
But mock should do this, that's why this is weird. We fully restart the DNF install process from scratch.
The way createrepo should work is that the repomd.xml file should be created as the last one.., so whenever this file is changed - we can be sure that other referenced metadata files are available as well (or at least there should be a very short race when moving files from a temp dir).
Yet, we seem to get repomd.xml file for a longer time period.
Can you please elaborate on the "Moves the existing one" part? Is that hardlinked?
Perhaps something could be done with that?
I believe the symlink s ln -sf action, and is done as the "last" action in the chain of related actions (to minimize the race conditions). Therefore dunno... I would bet that some caching goes against us... is there some? If so, is it fully disabled for the repomd.xml URLs?
ln -sf
repomd.xml
Using latest symlink is prone to race condition - the symlink can be changed underneath while repo is being downloaded, which can lead not only to failures to load repo by DNF. But more importantly, builds for different architectures can be done against different repos (there may be a very significant delay between builds for different arches are ran by Copr) which can lead to subtle differences between the same package built for different arches. To avoid this race condition you can first get ID of the latest repo for particular tag (with call like koji call getRepo f36-build) and then use that repo ID instead of latest. Koschei and Koji itself always refer to repos by specifying repo ID, never by latest symlink. They download Koji repos very frequently and I haven't seen any issue with repodata caching.
latest
koji call getRepo f36-build
But more importantly, builds for different architectures can be done against different repos (there may be a very significant delay between builds for different arches are ran by Copr) which can lead to subtle differences between the same package built for different arches.
Well, the symbolic link is though in the upper level, so it should be ARCH-agnostic?
I understand that the results might be different, all the metadata change between architectures (when one arch is done later the other). But this is something that we can tolerate in Copr.
To avoid this race condition you can first get ID of the latest repo for particular tag (with call like koji call getRepo f36-build) and then use that repo ID instead of latest.
I don't think we want to complicate the repo consumption in our logic. :-/ We don't want to close this race.
The problem I describe now is that, for a non-trivial amount of time, we face the inconsistency in the repodata (again, copr re-tries several times, and fails repeatedly - while it immediately on the second attempt should get updated repodate).
I mean - we would probably want to fix this as well - but I'm not sure it is worth it. Copr is close to Mock usage -- and mock is what users usually do locally ... that is, they tolerate the --enablerepo=local consequences.
--enablerepo=local
OTOH the problem we want to solve seems to be much simpler, yet it isn't obvious where is the problem.
Metadata Update from @mohanboddu: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: low-gain, low-trouble, ops
Well, I don't know whats happening here exactly, but I suspect:
So, I think we need dnf to also retry the repomd.xml in this case? Or check that it's not changed?
Otherwise using koji to see what the repo is, calling it by it's non latest version should work, but be more complex of course.
So, I think we need dnf to also retry the repomd.xml in this case?
We already do this. We restart the whole DNF process from scratch. That's why I don't get why this error can actually happen.
Huh, then I am puzzled how this is getting triggered.
Would it be possible to add some debugging? ie when this happens grab a index of that latest directory so we can see whats there? If it's just the hashes changed, or somehow it can't reach kojipkgs?
Meh, I'm blind, but an example of such build is here: https://download.copr.fedorainfracloud.org/results/%40python/python3.11/fedora-rawhide-x86_64/03329977-plplot/ https://download.copr.fedorainfracloud.org/results/%40python/python3.11/fedora-rawhide-x86_64/03329977-plplot/chroot_scan/var/lib/mock/fedora-rawhide-x86_64-1644133691.331359/root/var/log/ (we turn on debugging for all builds)
It really looks like repomd.xml is repeatedly returned (because of caches?) to the client... This isn't a problem for the <id> based URLs, but for the latest/ could be.
Varnish seems to identify itself in the requests :-/ perhaps we could set some no-cache hader for the repomd.xml files (only). A very similar thing is done on Copr Backend: https://pagure.io/fedora-infra/ansible/blob/3186e413d691dfd581344e2e6bea53741bd3d30d/f/roles/copr/backend/templates/lighttpd/lighttpd.conf#_541-543
Ah indeed, kojipkgs uses a varnish server.
I thought we did have repodata excluded at one point, but I guess that got dropped somewhere.
roles/varnish/templates/kojipkgs.vcl.j2 is the file if you want to propose a PR, otherwise I will try and take a look soon.
Hm, I'm unable to fill a PR ... pagure has a bad day :-(
https://pagure.io/fedora-infra/ansible/pull-request/968
Commit 8caaee2b fixes this issue
Log in to comment on this ticket.