Issue #10114: Prod ostree repo missing content - releng

releng

#10114 Prod ostree repo missing content

Closed: Fixed 2 years ago by kevin. Opened 3 years ago by jlebon.

Describe the issue

The ostree repo at https://kojipkgs.fedoraproject.org/ostree/repo/ (backing https://ostree.fedoraproject.org) appears to be missing content. E.g. the config file is missing, and from a light perusal, at least subdirs objects/0a/, objects/0c/, and objects/89/ are missing.

This is causing OSTree pull failures like:

$ G_MESSAGES_DEBUG=all ostree pull fedora:fedora/34/x86_64/silverblue
...
(ostree pull:132813): OSTree-DEBUG: 18:07:30.493: starting fetch of 0ad4b715134f5e414b60f270b0a5b58999be6bffa483905a9812e4c9450ad3f0.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.493: starting fetch of 897d9263cef121894aef9b0578a72935b81af9498169a0e5a04d6c1b5098e257.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.494: starting fetch of 0cc780539dceb070c6aed1c8f05737955b82fbd1507bab59893ab3c8a1bb129c.file
(ostree pull:132813): OSTree-DEBUG: 18:07:30.590: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.590: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.621: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.621: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: Request caught error: Server returned HTTP 404
(ostree pull:132813): OSTree-DEBUG: 18:07:30.682: pull: idle, exiting mainloop
...
error: Server returned HTTP 404

This affects all ostree variants, so we need to fix it promptly and determine what actually removed those files.

kevin commented 3 years ago

After some investigation we are restoring content from a snapshot.

We have also disabled the updates-sync and ostree-pruner. These are the only 2 scripts that touch on that repo.

This may be related to the bodhi-backend01 upgrade/reinstall earlier today, but it's unclear what actually might have caused this.

jlebon commented 3 years ago

I don't think this is the updates-sync script. It only does ostree pull-local and ostree summary --update, neither of which should trigger a prune.

As to the ostree pruner, AFAICT it's not actually fully online yet. It seems like the pod just sleeps for now, and @dustymabe 's PR to add it is still pending (https://github.com/coreos/fedora-coreos-releng-automation/pull/79).

So, I suspect there might've been something else going on here. Based on the fact whole subdirs from the objects/ dir are gone, I'd say something not ostree aware.

Edited 3 years ago by jlebon

walters commented 3 years ago

If need be for coreos we can pretty easily get the objects from our cached ostree tarballs. For example, https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/33.20210426.3.0/x86_64/fedora-coreos-33.20210426.3.0-ostree.x86_64.tar is a tarball of an archive mode repo of the latest stable update. So, something like this:

$ mkdir repo
$ cd repo
$ curl -L https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/33.20210426.3.0/x86_64/fedora-coreos-33.20210426.3.0-ostree.x86_64.tar | tar xf -
$ ostree --repo=/path/to/prod/repo pull-local .

walters commented 3 years ago

AFAIK IoT and Silverblue do not have this, but in the end forcing a new build (with e.g. --force-nocache, I think there's a way to do that with pungi) will add things back and clients will be able to update to it fine (history doesn't matter for updates).

walters commented 3 years ago

Seems to be resolved at least for SB.

dustymabe commented 3 years ago

The only change I can think of is that recently I re-deployed the coreos-ostree-importer because i updated the container image to f34 and changed the branch names for the upstream repo to main.

The importer itself doesn't prune anything, though. So I would be surprised if that was the source, unless there is a serious bug in either F34 or newer rpm-ostree that is causing data to get deleted.

kevin commented 3 years ago

So, status update:
We have completed our restore from snapshot. The repo should be fine again as of 4-6 hours ago.
We have manually run the updates-sync script and it worked fine with no unexpected output or errors.
I have now re-enabled updates pushes, should go out here soon.
Once updates are all pushed out, I am going to start a 'ostree fsck' on the prod repo

We are still at a loss as to root cause here. I can't think of anything that would just delete part of the ostree repo and nothing else. ;(

@humaton / @asaleh can you think of anything at all that could have caused this during the bodhi-backend01 re-install? I mean, the ftpsync user had the wrong uid, but that would just cause it to get permission denied I would think, no way it could delete anything. Very puzzling. ;(

Metadata Update from @smooge:
- Issue tagged with: high-gain, high-trouble, ops

3 years ago

kevin commented 2 years ago

So, the ostree fsck took weeks, but it did finish without errors. So I guess we are all ok now...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

walters commented 2 years ago

Thanks for checking this!

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Milestone

None

duplicate

None

blockedby

None

blocking

None

Boards 1

Ops Status: Backlog

releng

Source Code

Documentation

#10114 Prod ostree repo missing content Closed: Fixed 2 years ago by kevin. Opened 3 years ago by jlebon.

Metadata

high-trouble high-gain ops

Boards 1

#10114 Prod ostree repo missing content

Closed: Fixed 2 years ago by kevin. Opened 3 years ago by jlebon.