From time to time recently, one or both of our ipsilon idp servers have gone unresponsive.
When this happens haproxy should remove the unresponsive node, but it does not seem to do so and users get a 'gateway timeout' or the like on some requests (but not all).
The next time this happens we should check the haproxy check and see why it's not marking the machine as down.
We may also want to upgrade the nodes to fedora 39 to see if that helps this underlying issue.
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Copying over the impact details from https://pagure.io/fedora-infrastructure/issue/11830:
As the end-user, I wasn't getting a gateway timeout. Authentication attempts via id.fedoraproject.org are returning as invalid credentials. While this is happening, I'm still able to login to accounts.fedoraproject.org and get a fresh kerberos ticket for the FEDORAPROJECT.ORG realm.
Yeah, there seems to be several failure modes... that might not be related.
The instance stops answering requests at all, except the haproxy health checks (those work). So, people get a gateway timeout
sssd stops being able to contact the ipa servers, so requests to auth get a 'System error' return and the user gets a 'permission deined'.
Kerberos tickets/requests go direct to the ipa servers with gssproxy, so they don't hit the ipsilon / idp servers at all. It's only OIDC, openid, SAML2 requests that hit the ipsilon servers...
Since we are in beta freeze right now, I don't really want to do too much, but as soon as we are out I'd like to upgrade them and see if we can gather any more information... also, we should try and figure out a better health endpoint for haproxy so it could at least mark them down instead of sending them queries.
Noting that this is happening again -- reported initially at ~20:00 UTC on 2024-03-23 in Infra matrix channel.
I went to report it shortly before this comment (~21:45 UTC) as I am also having trouble logging into services via FAS.
Very weird issue. I have not run into this in my own deployments of ipsilon, but will ask around to see if anyone has some ideas.
02 was unresponsive, 01 needed sssd restarted.
everything should be back.
ok. We are out of freeze now.
So I reinstalled the with f39. They seem to be operating nicely now.
So, lets see if they have any issues now for a bit... if they do we can debug from there.
I'll keep this open for this week at least to see if we see any problems. Please add comments if you see gateway timeouts or auth failures when they shouldn't be happening.
Metadata Update from @kevin: - Issue assigned to kevin
ok, it's been a week. No problems so far.
Lets close this and if it happens again, please re-open this or file a new issue and we can debug more.
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.