#26 NFS server slow on New Cluster
Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by siddharthvipul1.

I was notified by @jlebon that the NFS server backing the Jenkins PVC sometimes gets really slow which causes the Jenkins to slow down -> long numbers of jobs in the queue -> CI panic and overload etc..

It fixes itself after a while (from what I know).. we should identify what the issue is and if it can be resolved.

Here is the log that was provided by jlebon

bash-4.2$ findmnt /var/lib/jenkins
TARGET           SOURCE
                             FSTYPE OPTIONS
/var/lib/jenkins
nfs02.ci.centos.org:/exports/ocp-prod/pv-10gi-c7401bb0-4053-5307-9e3c-873580f0f23e
nfs4   rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19.
bash-4.2$ time sh -c 'echo foo > /var/lib/jenkins/zzz'

real    0m17.445s
user    0m0.001s
sys     0m0.002s

Metadata Update from @dkirwan:
- Issue tagged with: centos-ci-infra, medium-gain, medium-trouble, need-more-info

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: medium-gain, medium-trouble

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: need-more-info
- Issue priority set to: Waiting on Assignee
- Issue tagged with: groomed, medium-gain, medium-trouble

3 years ago

I was speaking with @arrfab, he mentioned that the NFS server itself is a Seamicro board, with simple SATA drives with md device in software raid.

The Openshift nodes themselves, have only a 1Gbit single network interface used for all traffic, so if we have multiple containers running on the same node accessing storage this might exacerbate the slowness problem.

Thanks for filing this @siddharthvipul1!

FWIW, the time when I noticed this happen was Aug 2 10:40 AM EST (in case it could help with looking at NFS/node journal logs).

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

3 years ago

pods_active.png

There were a number of pods running around this time where you reported problems @jlebon they've been long cleaned up, I'm not sure which nodes they ran on during this period.

I'll have to do some further investigation into the journal logs on the nodes.

Metadata Update from @dkirwan:
- Issue tagged with: groomed

3 years ago

Metadata Update from @dkirwan:
- Issue priority set to: None (was: Waiting on Assignee)

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: groomed

3 years ago

This is an issue again and I think it's preventing Jenkins from standing up in the fedora-coreos project.

bash-4.2$ findmnt /var/lib/jenkins
TARGET           SOURCE                                                                            FSTYPE OPTIONS
/var/lib/jenkins nfs02.ci.centos.org:/exports/ocp-prod/pv-5gi-4b038b71-3f49-579b-ae04-a147ddcc1140 nfs4   rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19.0.137,local_lock=none,addr=172.19.0.22
bash-4.2$ time sh -c 'echo foobar > zzz'

real    0m12.413s
user    0m0.001s
sys     0m0.002s

Metadata Update from @arrfab:
- Issue marked as depending on: #53

3 years ago

Linked to #53 where it's all discussed so no need to copy/paste comment in all impacted issues/tickets but just linking to main one so that people can also subscribe and follow

Should be resolved now.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Boards 1
CentOS CI Infra Status: Done
Attachments 1
Attached 3 years ago View Comment