Issue #8180: openshift/stg: new rollouts get stuck (and then fail) with no logs - fedora-infrastructure

fedora-infrastructure

#8180 openshift/stg: new rollouts get stuck (and then fail) with no logs

Closed: Fixed 4 years ago by kevin. Opened 4 years ago by lucab.

As per subject, I'm unable to rollout new deployment revisions on the fedora-infra OpenShift staging cluster, as they get stuck with no observable logs/events.

Specifically, I've observed the following on coreos-cincinnati/coreos-cincinnati-stub deployment:
* rollout #45: started on 2019-09-05, successful and currently active
* rollout #47: started on 2019-09-08, failed. I can see no logs nor events to actually know why it failed and how long it took to fail
* rollout #48: started on 2019-09-09, still pending scheduling (waiting since more than 1h at the time of this ticket). I can see no logs nor events to actually know what's waiting for.

Quick link: https://os.stg.fedoraproject.org/console/project/coreos-cincinnati/browse/dc/coreos-cincinnati-stub

cverna commented 4 years ago

Got more info with oc describe on the pod

 Normal   Pulling           1h (x3 over 1h)     kubelet, os-node04.stg.phx2.fedoraproject.org  pulling image "registry.redhat.io/openshift3/ose-deployer:v3.11.43"
  Warning  Failed            1h (x3 over 1h)     kubelet, os-node04.stg.phx2.fedoraproject.org  Failed to pull image "registry.redhat.io/openshift3/ose-deployer:v3.11.43": rpc error: code = Unknown desc = pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on server misbehaving
  Normal   BackOff           1h (x4 over 1h)     kubelet, os-node04.stg.phx2.fedoraproject.org  Back-off pulling image "registry.redhat.io/openshift3/ose-deployer:v3.11.43"
  Warning  Failed            1h (x4 over 1h)     kubelet, os-node04.stg.phx2.fedoraproject.org  Error: ImagePullBackOff
  Warning  Failed            20m (x17 over 1h)   kubelet, os-node04.stg.phx2.fedoraproject.org  Error: ErrImagePull

I think this is because registry.redhat.io is now registry.access.redhat.com. I ll wait for @kevin to be around since he setup the cluster before trying to make any changes.

Metadata Update from @cverna:
- Issue assigned to cverna
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 years ago

walters commented 4 years ago

I think this is because registry.redhat.io is now registry.access.redhat.com

Hmm, no it's more the reverse; registry.redhat.io is the default and requires signin/terms acceptance. However registry.access.redhat.com will remain because (AIUI)

We want unauthenticated pulls of registry.access.redhat.com/ubi8/ubi:latest
We need a really long deprecation period anyways

That said...this looks like DNS in the cluster, not a problem with upstream.

kevin commented 4 years ago

Yes, this is dns. The problem is that internally when we resolve registry.redhat.io we get an internal address that doesn't work due to firewalls between Fedora and Red Hat networks.

So, we override it via /etc/hosts, which works fine until akami or whoever moves ips around and then the one we were using stops working. ;(

For now I will put in a freeze break to fix the ip, but we should come up with a better perm solution.

dustymabe commented 4 years ago

Could we just run dig and specify a public DNS server and detect when it changes in a cron job?

$ dig +short registry.redhat.io
registry.redhat.io.edgekey.net.
e14353.g.akamaiedge.net.
23.196.120.110
$ dig @8.8.8.8 +short registry.redhat.io
registry.redhat.io.edgekey.net.
e14353.g.akamaiedge.net.
184.31.48.44

kevin commented 4 years ago

Should be fixed now.

:rooster:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

kevin commented 4 years ago

Could we just run dig and specify a public DNS server and detect when it changes in a cron job?
$ dig +short registry.redhat.io
registry.redhat.io.edgekey.net.
e14353.g.akamaiedge.net.
23.196.120.110
$ dig @8.8.8.8 +short registry.redhat.io
registry.redhat.io.edgekey.net.
e14353.g.akamaiedge.net.
184.31.48.44

I really dislike the complexity that would bring...

Cron job runs somewhere (needs to always get an answer/check results)
Would have to commit to ansible or have access to /etc/hosts on all the openshift nodes.

I think it's likely better to see if we can get our nameservers to forward this domain differently.

dustymabe commented 4 years ago

can I get access to the project to check it out?

I don't know what I need to do to get access, but for reference: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=11c0f4865e457d12e0d030beea5c65f234a94f61.

So the question is: without deleting/recreating the project how do I get access to the project (i.e. an appowner)?

Edited 4 years ago by dustymabe