#10114 Prod ostree repo missing content
Closed: Fixed 2 years ago by kevin. Opened 3 years ago by jlebon.

  • Describe the issue

The ostree repo at https://kojipkgs.fedoraproject.org/ostree/repo/ (backing https://ostree.fedoraproject.org) appears to be missing content. E.g. the config file is missing, and from a light perusal, at least subdirs objects/0a/, objects/0c/, and objects/89/ are missing.

This is causing OSTree pull failures like:

$ G_MESSAGES_DEBUG=all ostree pull fedora:fedora/34/x86_64/silverblue
...
(ostree pull:132813): OSTree-DEBUG: 18:07:30.493: starting fetch of 0ad4b715134f5e414b60f270b0a5b58999be6bffa483905a9812e4c9450ad3f0.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.493: starting fetch of 897d9263cef121894aef9b0578a72935b81af9498169a0e5a04d6c1b5098e257.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.494: starting fetch of 0cc780539dceb070c6aed1c8f05737955b82fbd1507bab59893ab3c8a1bb129c.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.590: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.590: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.621: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.621: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: pull: idle, exiting mainloop
...
error: Server returned HTTP 404

This affects all ostree variants, so we need to fix it promptly and determine what actually removed those files.


After some investigation we are restoring content from a snapshot.

We have also disabled the updates-sync and ostree-pruner. These are the only 2 scripts that touch on that repo.

This may be related to the bodhi-backend01 upgrade/reinstall earlier today, but it's unclear what actually might have caused this.

I don't think this is the updates-sync script. It only does ostree pull-local and ostree summary --update, neither of which should trigger a prune.

As to the ostree pruner, AFAICT it's not actually fully online yet. It seems like the pod just sleeps for now, and @dustymabe 's PR to add it is still pending (https://github.com/coreos/fedora-coreos-releng-automation/pull/79).

So, I suspect there might've been something else going on here. Based on the fact whole subdirs from the objects/ dir are gone, I'd say something not ostree aware.

If need be for coreos we can pretty easily get the objects from our cached ostree tarballs. For example, https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/33.20210426.3.0/x86_64/fedora-coreos-33.20210426.3.0-ostree.x86_64.tar is a tarball of an archive mode repo of the latest stable update. So, something like this:

$ mkdir repo
$ cd repo
$ curl -L https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/33.20210426.3.0/x86_64/fedora-coreos-33.20210426.3.0-ostree.x86_64.tar | tar xf -
$ ostree --repo=/path/to/prod/repo pull-local . 

AFAIK IoT and Silverblue do not have this, but in the end forcing a new build (with e.g. --force-nocache, I think there's a way to do that with pungi) will add things back and clients will be able to update to it fine (history doesn't matter for updates).

Seems to be resolved at least for SB.

The only change I can think of is that recently I re-deployed the coreos-ostree-importer because i updated the container image to f34 and changed the branch names for the upstream repo to main.

The importer itself doesn't prune anything, though. So I would be surprised if that was the source, unless there is a serious bug in either F34 or newer rpm-ostree that is causing data to get deleted.

So, status update:
We have completed our restore from snapshot. The repo should be fine again as of 4-6 hours ago.
We have manually run the updates-sync script and it worked fine with no unexpected output or errors.
I have now re-enabled updates pushes, should go out here soon.
Once updates are all pushed out, I am going to start a 'ostree fsck' on the prod repo

We are still at a loss as to root cause here. I can't think of anything that would just delete part of the ostree repo and nothing else. ;(

@humaton / @asaleh can you think of anything at all that could have caused this during the bodhi-backend01 re-install? I mean, the ftpsync user had the wrong uid, but that would just cause it to get permission denied I would think, no way it could delete anything. Very puzzling. ;(

Metadata Update from @smooge:
- Issue tagged with: high-gain, high-trouble, ops

3 years ago

So, the ostree fsck took weeks, but it did finish without errors. So I guess we are all ok now...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Thanks for checking this!

Log in to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog