I was notified by @jlebon that the NFS server backing the Jenkins PVC sometimes gets really slow which causes the Jenkins to slow down -> long numbers of jobs in the queue -> CI panic and overload etc..
It fixes itself after a while (from what I know).. we should identify what the issue is and if it can be resolved.
Here is the log that was provided by jlebon
bash-4.2$ findmnt /var/lib/jenkins TARGET SOURCE FSTYPE OPTIONS /var/lib/jenkins nfs02.ci.centos.org:/exports/ocp-prod/pv-10gi-c7401bb0-4053-5307-9e3c-873580f0f23e nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19. bash-4.2$ time sh -c 'echo foo > /var/lib/jenkins/zzz' real 0m17.445s user 0m0.001s sys 0m0.002s
Metadata Update from @dkirwan: - Issue tagged with: centos-ci-infra, medium-gain, medium-trouble, need-more-info
Metadata Update from @dkirwan: - Issue untagged with: medium-gain, medium-trouble
Metadata Update from @dkirwan: - Issue untagged with: need-more-info - Issue priority set to: Waiting on Assignee - Issue tagged with: groomed, medium-gain, medium-trouble
I was speaking with @arrfab, he mentioned that the NFS server itself is a Seamicro board, with simple SATA drives with md device in software raid.
The Openshift nodes themselves, have only a 1Gbit single network interface used for all traffic, so if we have multiple containers running on the same node accessing storage this might exacerbate the slowness problem.
Thanks for filing this @siddharthvipul1!
FWIW, the time when I noticed this happen was Aug 2 10:40 AM EST (in case it could help with looking at NFS/node journal logs).
Metadata Update from @dkirwan: - Issue assigned to dkirwan
<img alt="pods_active.png" src="/centos-infra/issue/raw/files/31abc5b62a91ba6e1389adc01ec3744a2db8f171e5c5c49c3e16c382bf0efe08-pods_active.png" />
There were a number of pods running around this time where you reported problems @jlebon they've been long cleaned up, I'm not sure which nodes they ran on during this period.
I'll have to do some further investigation into the journal logs on the nodes.
Metadata Update from @dkirwan: - Issue tagged with: groomed
Metadata Update from @dkirwan: - Issue priority set to: None (was: Waiting on Assignee)
Metadata Update from @dkirwan: - Issue untagged with: groomed
This is an issue again and I think it's preventing Jenkins from standing up in the fedora-coreos project.
bash-4.2$ findmnt /var/lib/jenkins TARGET SOURCE FSTYPE OPTIONS /var/lib/jenkins nfs02.ci.centos.org:/exports/ocp-prod/pv-5gi-4b038b71-3f49-579b-ae04-a147ddcc1140 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19.0.137,local_lock=none,addr=172.19.0.22 bash-4.2$ time sh -c 'echo foobar > zzz' real 0m12.413s user 0m0.001s sys 0m0.002s
Metadata Update from @arrfab: - Issue marked as depending on: #53
Linked to #53 where it's all discussed so no need to copy/paste comment in all impacted issues/tickets but just linking to main one so that people can also subscribe and follow
Should be resolved now.
Metadata Update from @dkirwan: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.