deltajoka
Contributor
Contributor

SCSI errors on VMs on NFS datastores

Hi,

We have recently been experiencing some problems with our VMs on NFS datastores on a IBM N series/NetApp filer. We have about 5 to 10 short incidents per month where VMs (simultaneously on multiple ESX hosts) are experiencing some kind of scsi "lag"/timeout. The following types of messages are then logged on the console (and in the kern.log) during these "glitches":

mptscsih: ioc0: attempting task abort! (sc=de2e0280)

mptbase: ioc0: IOCStatus(0x004b): SCSI IOC Terminated

mptscsih: ioc0: task abort: SUCCESS (sc=de2e0280)

mptbase: ioc0: IOCStatus(0x0002): Busy

mptbase: ioc0: IOCStatus(0x0002): Busy

Usually, some (sometimes only one host, sometimes several, even though VMs on all hosts are affected) of the ESX hosts are logging single occurrances of the following type of nfs-related errors:

esx03 vmkernel: 118:20:41:44.851 cpu1:1110)WARNING: NFS: 4590: Can't find call with serial number -2146566064

esx04 kernel: nfs_statfs64: statfs error = 5

esx01 kernel: nfs_statfs: statfs error = 5

We have been investigating counters on the switches and on the filer. There seem to be some retransmitting of tcp packets occurring, but no dropped packages or packages with bad headers/invalid checksums or similar.

If these problems would be the result of high IO or latency on the filer, wouldn't the effect be slower transfers rather than VMs simply "losing" their disks for a short period of time?

The ESX hosts are HP DL360 G5, running ESX 3.5u4. The switches are Cisco 2960 (gigabit), with flow control disabled.

Any input on this matter is most welcomed!

0 Kudos
11 Replies
TobiasKracht
Expert
Expert

Looks like if you a torouble with a disk perfomance, and host copies files faster, than client writes it.

StarWind Software R&D

StarWind Software R&D http://www.starwindsoftware.com
0 Kudos
deltajoka
Contributor
Contributor

We've now enabled flow control according to the NetApp -> VMware Infrastructure 3 best practices document. Flow control "send on/recieve off" on the endpoints (the NFS filer and the ESX hosts) and "send off/recieve on" on the switches. But that did not resolve the issue. We've now had a few serious problems with VMs getting read only filesystems due to this, so the situations is pretty bad. VMware simply replied in the SR that the problem couldn't be on their side and pointed out the filer as the source of the problem, and that there were no options to tweak NFS performance on the ESX hosts (which is not exactly the truth..).

0 Kudos
deltajoka
Contributor
Contributor

From the last incident (happens a few times a week now):

The problem was visible in the logs of all ESX hosts and all VMs running Debian Stable/Lenny. The time between the first indication in the logs and the last was 46 seconds, so the incident/downtime was at least that long.

By default, the SCSI timeout on Debian Lenny is 30 seconds. On CentOS it is 60 seconds, which is probably why we've only seen this affect our CentOS VMs just a few times. I don't know the SCSI timeout settings on Windows, but they have never been affected by this, so i guess it's a bit more than 60 seconds at least.

Messages from one of the ESX hosts during the incident:

Nov 3 20:43:30 esx-test vmkernel: 6:04:45:07.910 cpu2:4100)WARNING: NFS: 277: Lost connection to server xxx.xxx.xxx.xxx mount point /vol/datastore_xxx, mounted as 86ec7c53-48b328a6-0000-000000000000 ("datastore_xxx")

Nov 3 20:43:34 esx-test vmkernel: 6:04:45:11.918 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410003834350 4

Nov 3 20:43:34 esx-test vmkernel: 6:04:45:11.918 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410003834a50 4

Nov 3 20:43:34 esx-test vmkernel: 6:04:45:11.918 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410003833710 4

Nov 3 20:43:34 esx-test vmkernel: 6:04:45:11.918 cpu2:4098)NFSLock: 584: Stop accessing fd 0x4100038338d0 4

Nov 3 20:43:50 esx-test vmkernel: 6:04:45:28.237 cpu3:4245)NFSLock: 545: Start accessing fd 0x4100038338d0 again

Nov 3 20:43:50 esx-test vmkernel: 6:04:45:28.239 cpu3:4245)NFSLock: 545: Start accessing fd 0x410003833710 again

Nov 3 20:43:50 esx-test vmkernel: 6:04:45:28.295 cpu3:4245)NFSLock: 545: Start accessing fd 0x410003834a50 again

Nov 3 20:43:50 esx-test vmkernel: 6:04:45:28.297 cpu3:4245)NFSLock: 545: Start accessing fd 0x410003834350 again

Nov 3 20:43:51 esx-test vmkernel: 6:04:45:29.506 cpu2:4244)NFS: 286: Restored connection to server xxx.xxx.xxx.xxx mount point /vol/datastore_xxx, mounted as 86ec7c53-48b328a6-0000-000000000000 ("datastore_xxx")

Another ESX host reported this:

Nov 3 20:43:51 esx03 vmkernel: 140:22:56:41.238 cpu2:2940)WARNING: NFS: 4590: Can't find call with serial number -2146566023

On all ESX hosts there are also normal "kernel" messages (not vmkernel) of the type:

esx-test kernel: nfs: server xxx.xxx.xxx.xxx not responding, timed out

esx-test kernel: nfs_statfs: statfs error = 5

esx-test kernel: nfs: server xxx.xxx.xxx.xxx not responding, timed out

esx-test kernel: nfs: server xxx.xxx.xxx.xxx not responding, timed out

esx-test kernel: nfs_statfs: statfs error = 5

One ESX host also reported this:

Nov 3 20:43:40 esx05 vmkernel: 95:11:13:47.039 cpu3:1714)WARNING: Swap: vm 1714: 7515: Swap sync read failed: status=195887167, retrying...

Nov 3 20:43:40 esx05 vmkernel: 95:11:13:47.039 cpu2:1228)WARNING: Swap: vm 1228: 7528: Swap sync read retry failed: status=195887167, retry=1

Nov 3 20:43:40 esx05 vmkernel: 95:11:13:47.039 cpu3:1714)WARNING: Swap: vm 1714: 7528: Swap sync read retry failed: status=195887167, retry=1

Nov 3 20:43:40 esx05 vmkernel: 95:11:13:47.089 cpu2:1228)WARNING: Swap: vm 1228: 7528: Swap sync read retry failed: status=195887167, retry=2

Our VMs have a light load and are normally not swapping, so this is the first time that we've seen swap sync errors (and we've had this problem for serveral months now). We haven't put our VM swap files on local datastores yet, but since the swap files are barely used and we haven't seen this error before i think it's safe to assume that the fact that the swap files being on the filer and not on local datastores is NOT a part of the root cause of these NFS problems.

The Debian VMs reported these kind of SCSI timeout related messages in their kernel logs:

Nov 3 20:43:45 testvm kernel: http://104285.398561 mptscsih: ioc0: attempting task abort! (sc=de1f5d80)

Nov 3 20:43:45 testvm kernel: http://104285.398617 sd 0:0:0:0: CDB: Write(10): 2a 00 00 14 04 17 00 00 08 00

Nov 3 20:43:45 testvm kernel: http://104285.510003 mptscsih: ioc0: task abort: SUCCESS (sc=de1f5d80)

Nov 3 20:43:45 testvm kernel: http://104285.510079 mptscsih: ioc0: attempting task abort! (sc=de1f5180)

Nov 3 20:43:45 testvm kernel: http://104285.510136 sd 0:0:0:0: CDB: Write(10): 2a 00 00 28 00 67 00 00 08 00

Nov 3 20:43:45 testvm kernel: http://104285.621856 mptscsih: ioc0: task abort: SUCCESS (sc=de1f5180)

Now that we've gathered more types of errors, we have a lot in common with the person in this thread: http://communities.vmware.com/message/1283279

0 Kudos
bobross
Hot Shot
Hot Shot

This phenomenon is why we stopped running swap on NFS a long time ago - it's just not a good architectural choice. We run swap on Xiotech DAS now and it cooks.

0 Kudos
deltajoka
Contributor
Contributor

Hi,

We are well aware of that. We are planning on placing the swap files on local datastores, but it's not an immediate issue since our VMs today barely swap. But we are going to do it to avoid future issues though. Although, as i said earlier, we've only had swap related errors once but this NFS issue has been present for several months so the swap files are most likely not the cause of the problem.

0 Kudos
wjs
Contributor
Contributor

Any luck with this? I am experiencing the exact same issue. My CentOS images see the problem, but not Windows. My hardware is a bit different however. I assumed in my case, that my NFS server too busy, and I need to move some workloads around.

0 Kudos
mlubinski
Expert
Expert

I have the same issue, but I know where the problem is.

We are using NetApp FAS3040 with Ontap 7.2.4, and we are using 10g interfaces in netapp. There is known bug in 10g driver, that was finally fixed in newest ONTAP release. So we are going to upgrade our ONTAP soon to newest version.

1. Follow up best practicies from netapp ()

2. download on your redhat/centos (for debian an other os must change script) attached .sh script (and execute+reboot each llinux server after):

BTW, newest VM tools in esx4 should already contain this script inside.

After this when you do: cat /sys/block/sda/device/timeout you should get 190 as an output.

This is real pain in the ass - i know exactly. But please don't listen to people saying: yeah, we resigned from NFS, and went to DAS or iSCSI - NFS is MUCH better, and flexible. It works like a charm.Want to resize storage - go ahead (on the flight). That's really beautiful.

I hope that after my upgrade I will not see these vmkernel NFSLocks as well as IO errors on VMs.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]
deltajoka
Contributor
Contributor

Hi!

Thank you for your input!

Our filer is an IBM N3300 (FAS2020), running OnTap 7.3.1. But with 1 G interfaces, so our setups are a bit different and it's a bit hard to draw any real conclusions out of this. We have an open case with IBM regarding this. I will update them with the info you've provided as well.

Btw, are you saying that the latest VM-tools in ESX 4 is setting the SCSI timeout setting to 190 seconds? That sounds like a workaround rather than a solution Smiley Happy

0 Kudos
mlubinski
Expert
Expert

yes, correct. Indeed they should change timeout setting to 180 (AFAIR) in vmtools on vsphere4. Well yes, this is some workaround, but works Smiley Happy

Solution would be fixing this bug (i think in VMware NFS implementation), but still no fix available. In my previous post, one guy said upgrading to newest version solved this issue (nfslocks).

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]
0 Kudos
fletch00
Enthusiast
Enthusiast

We were having the same issues with NFSlock messages, VMs going offline/read-only.

Opened Netapp (ONtap 7.3.1.1) and VMWare esx4 U1 cases - nothing useful.

Then over the holiday break we updated all the vmware tools and hardware (v4->v7) and the issue has not occurred since.

VMware support now says this was the solution all along!

I asked them to point me to the technical document alerting customers to the importance of updating vmware-tools lest they get VM dataloss and they said it did not exist. I asked what the procedure was for requesting this document be created and they said they would not unless they could reproduce it!

MAKE SURE YOUR VMWARE-TOOLS ARE COMPLETELY UP TO DATE - is what vmware should immediately have told you.

also check out nfstop to identify which VMs are soaking up your limited supply of IOPS:

http://communities.vmware.com/message/1462324

I ended up migrating a few IO hogs to local disk off the NFS datastores.

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
mlubinski
Expert
Expert

Hi,

Thank you for your reply. The problem in my environment is that I have both ESX 3.5/4 but I wiil do following things to eliminate this issue:

In the end of february we upgrade ONTAP to 7.3.2 version, then I will keep it running as it is for 1-2 weeks, and will see if NFSLocks occur - if yes, then I will perform full Environment VMTools update/upgrade. I finally hope it will solve this annoying issue.

And this tool nfstop seem to be perfect - will check it out.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]
0 Kudos