retrace.fedoraproject.org had a expired cert. The playbook didn't change anything.
It looks like somehow the certbot files there got in a weird state.
I tried to get a new cert, but then I hit a letsencrypt limit:
There were too many requests of a given type :: Error creating new order :: too many certificates (5) already issued for this exact set of domains in the last 168 hours: retrace.fedoraproject.org,retrace03.rdu-cc.fedoraproject.org: see https://letsencrypt.org/docs/rate-limits/
So, I then just made one for retrace.fedoraproject.org to get past the limit.
After this limit expires we will want to get a new retrace03.rdu-cc.fedoraproject.org + retrace.fedoraproject.org one. Also, the playbook has some other issues, so it should be fixed up. :)
cc: @mgrabovs @praiskup
Weird, this was set up by @msuchy (notification about the certificate expiration should be sent to Mirek in advance, not sure why this isn't happening). I would take a look myself, but I don't have the permissions to ssh there. I think anyone has to take a look at what's in /var/log/letsencrypt and journalctl -u certbot-renew.service (IIRC certbot tries to renew the certificate a month before expiration).
In Copr we use the same role 'copr/certbot', and it seems to work fine at least there.
@msrb, FYI
I'll look into it.
Metadata Update from @zlopez: - Issue tagged with: copr
Metadata Update from @mohanboddu: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: low-gain, low-trouble, ops
Yeah, I think something was corrupt with the local certs store or something. I wish I had saved it off before I started messing with it. Basically it showed 2 expired certs, then when I renewed it said it did, but errored and the 2 expired certs were still the only ones showing in certbot. So, I deleted those and got a new cert, but by then I hit that limit. ;( So, I got a new cert for just retrace.fedoraproject.org and that worked fine.
I did also add a check in nagios for the cert.
I guess if you could try and get a new cert for both retrace.fedoraproject.org and retrace03.rdu-cc.fedoraproject.org and confirm that works, we could close this out. I am not sure how long we have to wait for that limit to be over though.
I tried to request a certificate for both domains just now, but the rate limit is still triggered.
So I just ran the retrace playbook and the copr/certbot role replaced the new certificate with the expired one again (relevant Ansible logs here). After switching back to the new certificate in /etc/httpd/conf.d/retrace_ssl.conf, everything's working, but there's something fishy.
copr/certbot
/etc/httpd/conf.d/retrace_ssl.conf
@msuchy Any idea what might have gone wrong? Could it be some corner case in copr/certbot or misconfiguration on our part?
Hmm, that may be related to latest changes @praiskup have done in roles/copr/certbot Please log on git log -p on this directory https://pagure.io/fedora-infra/ansible/blob/main/f/roles/copr/certbot
git log -p
Indeed this seems I broke something ... though I can't debug on that system :-/
The role though should only ever restore the files when the directory is not present: https://pagure.io/fedora-infra/ansible/blob/349238d2244219cbd315a436cb3c26de9afa35e6/f/roles/copr/certbot/tasks/letsencrypt.yml#_51
I.e. we try to restore (broken certs in this particular case) only when check whether we need to initialize letsencrypt first detects that there's no letsencrypt data yet. Is this the case?
check whether we need to initialize letsencrypt first
The workflow is:
if letsencrypt_initiated: do nothing elif is there a backup? restore from backup else: initialize the certificates with /bin/certbot (though this requires stopped http server)
This brings an inconvenience .... we should make sure the backup is up2date before we start a new machine from scratch (== we should run the playbook against the old VM to sync the backed up certificates). Should I document this somewhere?
I don't quite get where the LE quota is wasted, though. And also I don't get why the systemctl start certbot-renew.service isn't working.... is the well-known directory provided over port 80? Or is the https server started at all?
systemctl start certbot-renew.service
I just ran the retrace playbook and the certbot phase phase went through with no problem. The existing, valid certificate was kept in its place.
ok. So, I guess we need to wait and see if it can renew correctly when the time comes?
I guess I'll close this and we can re-open if/when there's renew problems?
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.