#10427 Weird liveness and readiness probe failures on stg / oraculum
Closed: Fixed 3 years ago by frantisekz. Opened 3 years ago by frantisekz.

Describe what you would like us to do:


I am seeing timeouts and pod kills on staging cluster in oraculum project / oraculum-api-endpoint . I was testing a trivial change in the oraculum codebase, it works well (I can curl the tested url from inside the pod just fine), but the deployment is being stopped due to probe timeouts.

The app is using s2i / gunicorn for deployment, the older image was deployed roughly 2 months ago and it worked just fine. The app lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/oraculum.yml and https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/oraculum .

Were there any changes in how the probes work? Can somebody point me where to look in attempting to solve this?

Thanks!

When do you need this to be done by? (YYYY/MM/DD)



The issues listed in events are:

Readiness probe failed: Get http://10.131.3.214:8080/: dial tcp 10.131.3.214:8080: connect: connection refused
Liveness probe failed: Get http://10.131.3.214:8080/: dial tcp 10.131.3.214:8080: connect: connection refused
Search Line limits were exceeded, some search paths have been omitted, the applied search line is: oraculum.svc.cluster.local svc.cluster.local cluster.local iad2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org

OpenShift then successfully rollbacks to a previous build which passes the probe check successfully. There weren't any dns/playbook changes, the app itself is working just fine in a new build as mentioned above.

This might have something to do with recent log4j mitigations. @dkirwan might know more?

There shouldn't be any functional changes on openshift 3.11 anymore.
I suspect your pod was deployed on a slow node, and the liveness probe triggered before the app was ready to accept connections.
You can try to raise your initialDelaySeconds and/or failureThreshold a bit to see if that solve your issue.

So, don't know what has changed, but the default --bind to 0.0.0.0:8080 was no longer added to gunicorn. Solved by https://pagure.io/fedora-qa/oraculum/c/eea63c73be59a007ed559503190bc1f7cafe0556?branch=master .

Metadata Update from @frantisekz:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata