#7180 candidate-registry.fedoraproject.org is unavailable
Closed: Fixed 5 years ago Opened 5 years ago by cverna.

  • Describe what you need us to do:
    I was trying to check which images we have on the candidate registry using https://candidate-registry.fedoraproject.org/v2/_catalog but it returns

  • When do you need this? (YYYY/MM/DD)
    ASAP

  • When is this no longer needed or useful? (YYYY/MM/DD)

  • If we cannot complete your request, what is the impact?
    All OSBS build will fail


Looking at the inventory in ansible, I'm seeing:

docker-candidate-registry01.stg.phx2.fedoraproject.org
docker-candidate-registry01.phx2.fedoraproject.org
docker-candidate-registry01.stg.phx2.fedoraproject.org

The staging instance is reachable: https://candidate-registry.stg.fedoraproject.org/v2/_catalog and I could access it via ssh, the prod instance seems unreachable via ssh

This might be related with @codeblock work to deploy the new registry in production. I think there were renamed oci-candidate-registry01.phx2.fedoraproject.org (looking at the commit history of the ansible repo)

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=171c5c1054d81c0af92ca9b6d1ac804e85cf5353

Hm, there is a playbooks/groups/releng-compose.yml playbook that seems to do something related with the candidate-registry, but it contains hosts: releng-compose:releng-stg neither of these groups include this host... :(

So here is what I tried:

I ran the playbooks/groups/oci-registry.yml a few times and had a few issues with it:

  • gluster failed with:
TASK [gluster/consolidated : Configure Gluster volume.] **************************************************************
Wednesday 22 August 2018  09:39:02 +0000 (0:00:00.220)       0:08:29.999 ****** 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: None
fatal: [docker-registry01.stg.phx2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "error running gluster (/usr/sbin/gluster --mode=script volume add-brick registry docker-registry01.stg.phx2.fedoraproject.org:/srv/glusterfs/ docker-registry02.stg.phx2.fedoraproject.org:/srv/glusterfs/ force) command (rc=1): volume add-brick: failed: Brick: docker-registry01.stg.phx2.fedoraproject.org:/srv/glusterfs not available. Brick may be containing or be contained by an existing brick.\n"}
...
TASK [gluster/consolidated : Configure Gluster volume.] **************************************************************
Wednesday 22 August 2018  09:39:10 +0000 (0:00:00.179)       0:08:38.442 ****** 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: None
fatal: [oci-registry02.phx2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "error running gluster (/usr/sbin/gluster --mode=script volume add-brick registry oci-registry01.phx2.fedoraproject.org:/srv/glusterfs/ oci-registry02.phx2.fedoraproject.org:/srv/glusterfs/ force) command (rc=1): volume add-brick: failed: Brick: oci-registry02.phx2.fedoraproject.org:/srv/glusterfs not available. Brick may be containing or be contained by an existing brick.\n"}

It seems the ssh fingerprint kept on changing:

TASK [basessh : make sure there is no old ssh host key for the host still around]
....
changed: [oci-registry02.phx2.fedoraproject.org -> localhost] => (item=/root/.ssh/known_hosts)

(^ Happened at every run)

The authenticity of host 'oci-registry02.phx2.fedoraproject.org (xxxxx)' can't be established.
RSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)?

Host became unreachable:

fatal: [oci-registry02.phx2.fedoraproject.org]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"oci-registry02.phx2.fedoraproject.org\". Make sure this host can be reached over ssh", "unreachable": true}

End of the run:

oci-registry01.phx2.fedoraproject.org : ok=1    changed=0    unreachable=1    failed=0   
oci-registry02.phx2.fedoraproject.org : ok=123  changed=3    unreachable=1    failed=0   

To push out @cverna's patch, I ran the playbook:

playbooks/groups/proxies.yml -t haproxy

Which finished fine.

Except that now both of these URLs are unreachable :(
- https://candidate-registry.stg.fedoraproject.org/v2/_catalog
- https://candidate-registry.fedoraproject.org/v2/_catalog

I'm considering either reverting @cverna's patch to see if that fixes the stg host or just wait for someone more qualified than me to help sort this out.

Sorry I couldn't help further, I hope I didn't do too much of a mess :(

@kevin fixes this.

There was one issue with the openvpn certificate: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=d84e1df and another in the haproxy since stg has not had the rename that prod did: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=450230a

Thanks @kevin :)

Metadata Update from @pingou:
- Issue close_status updated to: Fixed

5 years ago

Login to comment on this ticket.

Metadata