Now about 4 times koschei frontend pods just start crashing (resulting in one alert for every proxy and one for each openshift node, so about ~30 pages).
logs from the pods:
[Sun Aug 30 00:22:10.758996 2020] [lbmethod_heartbeat:notice] [pid 1:tid 140182298075456] AH02282: No slotmem from mod_heartmonitor [Sun Aug 30 00:22:10.759698 2020] [http2:warn] [pid 1:tid 140182298075456] AH02951: mod_ssl does not seem to be enabled [Sun Aug 30 00:22:10.759997 2020] [cgid:error] [pid 8:tid 140182298075456] (13)Permission denied: AH01243: Couldn't bind unix domain socket /etc/httpd/run/cgisock.1 [Sun Aug 30 00:22:10.767070 2020] [mpm_event:notice] [pid 1:tid 140182298075456] AH00489: Apache/2.4.41 (Fedora) mod_wsgi/4.6.6 Python/3.7 configured -- resuming normal operations [Sun Aug 30 00:22:10.767115 2020] [core:notice] [pid 1:tid 140182298075456] AH00094: Command line: 'httpd -D NO_DETACH -D FOREGROUND -f /usr/share/koschei/httpd.conf' [Sun Aug 30 00:22:10.767242 2020] [cgid:error] [pid 1:tid 140182298075456] (13)Permission denied: AH01240: Couldn't unlink unix domain socket /etc/httpd/run/cgisock.1 [Sun Aug 30 00:22:10.767282 2020] [cgid:crit] [pid 1:tid 140182298075456] AH01238: cgid daemon failed to initialize [Sun Aug 30 00:23:08.236387 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 13 still did not exit, sending a SIGTERM [Sun Aug 30 00:23:08.236427 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM [Sun Aug 30 00:23:10.238493 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 13 still did not exit, sending a SIGTERM [Sun Aug 30 00:23:10.238577 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM [Sun Aug 30 00:23:12.240639 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM [Sun Aug 30 00:23:13.241857 2020] [mpm_event:notice] [pid 1:tid 140182298075456] AH00491: caught SIGTERM, shutting dow
The perms on /run/httpd are weird, but nothing has changed that I can see in months.
We don't build the images for koschei, it pulls them from elsewhere, but our openshift says the image hasn't changed.
I have no idea whats going on here.
@mizdebsk any thoughts?
Metadata Update from @mobrien: - Issue priority set to: Waiting on External (was: Needs Review)
Is this the issue that's causing Koschei to show:
Application is not available
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.
?
Is this the issue that's causing Koschei to show: Application is not available The application is currently not serving requests at this endpoint. It may not have been started or is still starting. ?
Yes :(
I don't know what may be causing this, but I will look into the issue tomorrow.
Metadata Update from @mizdebsk: - Issue assigned to mizdebsk - Issue priority set to: Waiting on Assignee (was: Waiting on External)
Metadata Update from @mizdebsk: - Issue tagged with: ops
I am working on the issue.
It looks like this may be related to poor DB performance. OpenShift liveness and readiness probes have timeouts of 1 second -- they try to load /stats page, which results in the following slow query, that results in probe failures.
/stats
koschei=# explain analyze SELECT resource_consumption_stats.name AS resource_consumption_stats_name, resource_consumption_stats.arch AS resource_consumption_stats_arch, resource_consumption_stats.time AS resource_consumption_stats_time, resource_consumption_stats.time_percentage AS resource_consumption_stats_time_percentage FROM resource_consumption_stats ORDER BY resource_consumption_stats.time DESC NULLS LAST LIMIT 20 OFFSET 0; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- ------- Limit (cost=936146.77..936149.11 rows=20 width=44) (actual time=3550.219..3552.325 rows=20 loops=1) -> Gather Merge (cost=936146.77..3592774.85 rows=22769508 width=44) (actual time=3550.218..3552.320 rows=20 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=935146.75..963608.63 rows=11384754 width=44) (actual time=3545.225..3545.227 rows=16 loops=3) Sort Key: "time" DESC NULLS LAST Sort Method: top-N heapsort Memory: 27kB Worker 0: Sort Method: top-N heapsort Memory: 27kB Worker 1: Sort Method: top-N heapsort Memory: 27kB -> Parallel Seq Scan on resource_consumption_stats (cost=0.00..632202.54 rows=11384754 width=44) (actual time=1077.342..3541.631 rows=31055 lo ops=3) Planning Time: 10.913 ms Execution Time: 3552.396 ms (12 rows)
I've ran vacuum full resource_consumption_stats;.
vacuum full resource_consumption_stats;
Before vacuum:
koschei=# select pg_table_size('resource_consumption_stats'); pg_table_size --------------- 4247568384 (1 row)
After vacuum:
koschei=# select pg_table_size('resource_consumption_stats'); pg_table_size --------------- 7192576 (1 row)
Query performance after vacuum:
koschei=# explain analyze SELECT resource_consumption_stats.name AS resource_consumption_stats_name, resource_consumption_stats.arch AS resource_consumption_stats_arch, resource_consumption_stats.time AS resource_consumption_stats_time, resource_consumption_stats.time_percentage AS resource_consumption_stats_time_percentage FROM resource_consumption_stats ORDER BY resource_consumption_stats.time DESC NULLS LAST LIMIT 20 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=4287.74..4287.79 rows=20 width=44) (actual time=18.061..18.066 rows=20 loops=1) -> Sort (cost=4287.74..4520.65 rows=93165 width=44) (actual time=18.060..18.061 rows=20 loops=1) Sort Key: "time" DESC NULLS LAST Sort Method: top-N heapsort Memory: 27kB -> Seq Scan on resource_consumption_stats (cost=0.00..1808.65 rows=93165 width=44) (actual time=0.011..9.240 rows=93165 loops=1) Planning Time: 0.721 ms Execution Time: 18.085 ms (7 rows)
This seems to have fixed the issue. I've scaled add deployment configs to desired replica counts.
oc -n koschei scale --replicas 2 dc/frontend oc -n koschei scale --replicas 1 dc/polling oc -n koschei scale --replicas 1 dc/repo-resolver oc -n koschei scale --replicas 1 dc/build-resolver oc -n koschei scale --replicas 1 dc/scheduler oc -n koschei scale --replicas 1 dc/watcher
The issue is resolved for now, but it may reappear in the future unless autovacuum is fixed.
Metadata Update from @mizdebsk: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
In commit 150b9c3 I changed OpenShift probes so that they don't rely on the slow query any longer.
Log in to comment on this ticket.