PR#399: Increase gunicorn worker default timeout. - greenwave

greenwave

#399 Increase gunicorn worker default timeout.

Closed 5 years ago by cverna. Opened 5 years ago by cverna.

cverna/greenwave increase_gunicorn_timeout into master

Increase gunicorn worker default timeout.

Clement Verna • 5 years ago

46a0402

Dockerfile

file modified

+1 -1

		`@@ -35,4 +35,4 @@`
		`RUN rm -rf ./fedmsg.d`
		`USER 1001`
		`EXPOSE 8080`
		`- ENTRYPOINT docker/install-ca.sh && gunicorn-3 --workers 8 --bind 0.0.0.0:8080 --access-logfile=- --enable-stdio-inheritance greenwave.wsgi:app`
		`+ ENTRYPOINT docker/install-ca.sh && gunicorn-3 --workers 8 --timeout 330 --graceful-timeout 300 --bind 0.0.0.0:8080 --access-logfile=- --enable-stdio-inheritance greenwave.wsgi:app`

cverna commented 5 years ago

By default a gunicorn worker is killed and restarted after 30s.
In greenwave's case it can happen for a worker to have to wait
longer to fetch results from resultsDB.

This commit increases the timeout to 330s and 300s for the
graceful timeout. These values are temporary and will need to
be finely tune later.

Signed-off-by: Clement Verna cverna@tutanota.com

lucarval commented 5 years ago

@cverna, that seems a bit long. Is there a specific issue/outage you're trying to address?

lholecek commented 5 years ago

Can you instead specify those additional parameters in the OpenShift template which uses the container image?

cverna commented 5 years ago

@cverna, that seems a bit long. Is there a specific issue/outage you're trying to address?

So in Fedora's instance we were seeing a lot of these ( 105 times within an hour )

[2019-03-15 16:06:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:56)
[2019-03-15 16:06:23 +0000] [56] [INFO] Worker exiting (pid: 56)
[2019-03-15 16:06:23 +0000] [60] [INFO] Booting worker with pid: 60

We have deployed a fix to test this configuration. We are seeing a lot less error but there are still workers that are waiting for more than 5 minutes doing nothing (10 times in 1 hour) :(.

cverna commented 5 years ago

Also do note that Openshift's uses haproxy default configuration and terminate requests after 30s

We had to run the following on the greenwave route and that made the number of 504 returned by greenwave decrease.

oc -n greenwave annotate route greenwave-web --overwrite haproxy.router.openshift.io/timeout=330s

cverna commented 5 years ago

Can you instead specify those additional parameters in the OpenShift template which uses the container image?

Yes this is what we have ended up doing --> https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=f6784eb283dde2aee057da0c2772b61c243b7fa5

Do you want me to close this PR ?

gnaponie commented 5 years ago

We saw those logs also in the internal instance. I made some tests disabling completely the timeout, and in some cases I had to wait even more than 5 minutes. Increasing the timeout, in my opinion, is not going to solve the problem, and even if, it's not the solution that I would hope for.
For our objectives Greenwave should return in a more reasonable time (seconds).

Internally in stage the change to address this issue was already deployed and we didn't see anymore these kind of logs and problems.

So I would prefer not to merge this change. But let's hear what other people say.

Edited 5 years ago by gnaponie

cverna commented 5 years ago

For our objectives Greenwave should return in a more reasonable time (seconds).
Internally in stage the change to address this issue was already deployed and we didn't see anymore these kind of logs and problems.

What was the change ?

gnaponie commented 5 years ago

For our objectives Greenwave should return in a more reasonable time (seconds).
Internally in stage the change to address this issue was already deployed and we didn't see anymore these kind of logs and problems.

What was the change ?

https://pagure.io/greenwave/pull-request/378#request_diff

cverna commented 5 years ago

For our objectives Greenwave should return in a more reasonable time (seconds).
Internally in stage the change to address this issue was already deployed and we didn't see anymore these kind of logs and problems.
What was the change ?

https://pagure.io/greenwave/pull-request/378#request_diff

Thanks :)

cverna commented 5 years ago

I gonna close this, since we should not need it with the incoming change

Pull-Request has been closed by cverna

5 years ago