As per subject, I'm unable to rollout new deployment revisions on the fedora-infra OpenShift staging cluster, as they get stuck with no observable logs/events.
Specifically, I've observed the following on coreos-cincinnati/coreos-cincinnati-stub deployment: * rollout #45: started on 2019-09-05, successful and currently active * rollout #47: started on 2019-09-08, failed. I can see no logs nor events to actually know why it failed and how long it took to fail * rollout #48: started on 2019-09-09, still pending scheduling (waiting since more than 1h at the time of this ticket). I can see no logs nor events to actually know what's waiting for.
coreos-cincinnati/coreos-cincinnati-stub
Quick link: https://os.stg.fedoraproject.org/console/project/coreos-cincinnati/browse/dc/coreos-cincinnati-stub
Got more info with oc describe on the pod
Normal Pulling 1h (x3 over 1h) kubelet, os-node04.stg.phx2.fedoraproject.org pulling image "registry.redhat.io/openshift3/ose-deployer:v3.11.43" Warning Failed 1h (x3 over 1h) kubelet, os-node04.stg.phx2.fedoraproject.org Failed to pull image "registry.redhat.io/openshift3/ose-deployer:v3.11.43": rpc error: code = Unknown desc = pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on server misbehaving Normal BackOff 1h (x4 over 1h) kubelet, os-node04.stg.phx2.fedoraproject.org Back-off pulling image "registry.redhat.io/openshift3/ose-deployer:v3.11.43" Warning Failed 1h (x4 over 1h) kubelet, os-node04.stg.phx2.fedoraproject.org Error: ImagePullBackOff Warning Failed 20m (x17 over 1h) kubelet, os-node04.stg.phx2.fedoraproject.org Error: ErrImagePull
I think this is because registry.redhat.io is now registry.access.redhat.com. I ll wait for @kevin to be around since he setup the cluster before trying to make any changes.
Metadata Update from @cverna: - Issue assigned to cverna - Issue priority set to: Waiting on Assignee (was: Needs Review)
I think this is because registry.redhat.io is now registry.access.redhat.com
Hmm, no it's more the reverse; registry.redhat.io is the default and requires signin/terms acceptance. However registry.access.redhat.com will remain because (AIUI)
That said...this looks like DNS in the cluster, not a problem with upstream.
Yes, this is dns. The problem is that internally when we resolve registry.redhat.io we get an internal address that doesn't work due to firewalls between Fedora and Red Hat networks.
So, we override it via /etc/hosts, which works fine until akami or whoever moves ips around and then the one we were using stops working. ;(
For now I will put in a freeze break to fix the ip, but we should come up with a better perm solution.
Could we just run dig and specify a public DNS server and detect when it changes in a cron job?
$ dig +short registry.redhat.io registry.redhat.io.edgekey.net. e14353.g.akamaiedge.net. 23.196.120.110 $ dig @8.8.8.8 +short registry.redhat.io registry.redhat.io.edgekey.net. e14353.g.akamaiedge.net. 184.31.48.44
Should be fixed now.
:rooster:
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Could we just run dig and specify a public DNS server and detect when it changes in a cron job? $ dig +short registry.redhat.io registry.redhat.io.edgekey.net. e14353.g.akamaiedge.net. 23.196.120.110 $ dig @8.8.8.8 +short registry.redhat.io registry.redhat.io.edgekey.net. e14353.g.akamaiedge.net. 184.31.48.44
I really dislike the complexity that would bring...
Cron job runs somewhere (needs to always get an answer/check results) Would have to commit to ansible or have access to /etc/hosts on all the openshift nodes.
I think it's likely better to see if we can get our nameservers to forward this domain differently.
can I get access to the project to check it out?
I don't know what I need to do to get access, but for reference: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=11c0f4865e457d12e0d030beea5c65f234a94f61.
So the question is: without deleting/recreating the project how do I get access to the project (i.e. an appowner)?
@kevin thanks, new deployment rollout confirmed working.
Login to comment on this ticket.