Issue #26: NFS server slow on New Cluster - centos-infra

centos-infra

#26 NFS server slow on New Cluster

Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by siddharthvipul1.

I was notified by @jlebon that the NFS server backing the Jenkins PVC sometimes gets really slow which causes the Jenkins to slow down -> long numbers of jobs in the queue -> CI panic and overload etc..

It fixes itself after a while (from what I know).. we should identify what the issue is and if it can be resolved.

Here is the log that was provided by jlebon

bash-4.2$ findmnt /var/lib/jenkins
TARGET           SOURCE
                             FSTYPE OPTIONS
/var/lib/jenkins
nfs02.ci.centos.org:/exports/ocp-prod/pv-10gi-c7401bb0-4053-5307-9e3c-873580f0f23e
nfs4   rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19.
bash-4.2$ time sh -c 'echo foo > /var/lib/jenkins/zzz'

real    0m17.445s
user    0m0.001s
sys     0m0.002s

Metadata Update from @dkirwan:
- Issue tagged with: centos-ci-infra, medium-gain, medium-trouble, need-more-info

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: medium-gain, medium-trouble

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: need-more-info
- Issue priority set to: Waiting on Assignee
- Issue tagged with: groomed, medium-gain, medium-trouble

3 years ago

dkirwan commented 3 years ago

I was speaking with @arrfab, he mentioned that the NFS server itself is a Seamicro board, with simple SATA drives with md device in software raid.

The Openshift nodes themselves, have only a 1Gbit single network interface used for all traffic, so if we have multiple containers running on the same node accessing storage this might exacerbate the slowness problem.

Edited 3 years ago by dkirwan

jlebon commented 3 years ago

Thanks for filing this @siddharthvipul1!

FWIW, the time when I noticed this happen was Aug 2 10:40 AM EST (in case it could help with looking at NFS/node journal logs).

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

3 years ago

dkirwan commented 3 years ago

There were a number of pods running around this time where you reported problems @jlebon they've been long cleaned up, I'm not sure which nodes they ran on during this period.

I'll have to do some further investigation into the journal logs on the nodes.

Metadata Update from @dkirwan:
- Issue tagged with: groomed

3 years ago

Metadata Update from @dkirwan:
- Issue priority set to: None (was: Waiting on Assignee)

3 years ago

Metadata Update from @dkirwan:
- Issue untagged with: groomed

3 years ago

jlebon commented 3 years ago

This is an issue again and I think it's preventing Jenkins from standing up in the fedora-coreos project.

bash-4.2$ findmnt /var/lib/jenkins
TARGET           SOURCE                                                                            FSTYPE OPTIONS
/var/lib/jenkins nfs02.ci.centos.org:/exports/ocp-prod/pv-5gi-4b038b71-3f49-579b-ae04-a147ddcc1140 nfs4   rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.19.0.137,local_lock=none,addr=172.19.0.22
bash-4.2$ time sh -c 'echo foobar > zzz'

real    0m12.413s
user    0m0.001s
sys     0m0.002s

Metadata Update from @arrfab:
- Issue marked as depending on: #53

3 years ago

arrfab commented 3 years ago

Linked to #53 where it's all discussed so no need to copy/paste comment in all impacted issues/tickets but just linking to main one so that people can also subscribe and follow

dkirwan commented 3 years ago

Should be resolved now.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

dkirwan

Tags

Blocking

None

Depending on

#53

Extend storage space on storage02 node with new HDD

Priority

None

Boards 1

CentOS CI Infra Status: Done

Attachments 1

pods_active.png

Attached 3 years ago View Comment

centos-infra

Source Code

#26 NFS server slow on New Cluster Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by siddharthvipul1.

Metadata

centos-ci-infra medium-gain medium-trouble

Boards 1

Attachments 1

#26 NFS server slow on New Cluster

Closed: Fixed 3 years ago by dkirwan. Opened 3 years ago by siddharthvipul1.