Hi,
We have recently been experiencing some problems with our VMs on NFS datastores on a IBM N series/NetApp filer. We have about 5 to 10 short incidents per month where VMs (simultaneously on multiple ESX hosts) are experiencing some kind of scsi "lag"/timeout. The following types of messages are then logged on the console (and in the kern.log) during these "glitches":
mptscsih: ioc0: attempting task abort! (sc=de2e0280)
mptbase: ioc0: IOCStatus(0x004b): SCSI IOC Terminated
mptscsih: ioc0: task abort: SUCCESS (sc=de2e0280)
mptbase: ioc0: IOCStatus(0x0002): Busy
mptbase: ioc0: IOCStatus(0x0002): Busy
Usually, some (sometimes only one host, sometimes several, even though VMs on all hosts are affected) of the ESX hosts are logging single occurrances of the following type of nfs-related errors:
esx03 vmkernel: 118:20:41:44.851 cpu1:1110)WARNING: NFS: 4590: Can't find call with serial number -2146566064
esx04 kernel: nfs_statfs64: statfs error = 5
esx01 kernel: nfs_statfs: statfs error = 5
We have been investigating counters on the switches and on the filer. There seem to be some retransmitting of tcp packets occurring, but no dropped packages or packages with bad headers/invalid checksums or similar.
If these problems would be the result of high IO or latency on the filer, wouldn't the effect be slower transfers rather than VMs simply "losing" their disks for a short period of time?
The ESX hosts are HP DL360 G5, running ESX 3.5u4. The switches are Cisco 2960 (gigabit), with flow control disabled.
Any input on this matter is most welcomed!
We have recently been experiencing some problems with our VMs on NFS datastores on a IBM N series/NetApp filer. We have about 5 to 10 short incidents per month where VMs (simultaneously on multiple ESX hosts) are experiencing some kind of scsi "lag"/timeout. The following types of messages are then logged on the console (and in the kern.log) during these "glitches":
mptscsih: ioc0: attempting task abort! (sc=de2e0280)
mptbase: ioc0: IOCStatus(0x004b): SCSI IOC Terminated
mptscsih: ioc0: task abort: SUCCESS (sc=de2e0280)
mptbase: ioc0: IOCStatus(0x0002): Busy
mptbase: ioc0: IOCStatus(0x0002): Busy
Usually, some (sometimes only one host, sometimes several, even though VMs on all hosts are affected) of the ESX hosts are logging single occurrances of the following type of nfs-related errors:
esx03 vmkernel: 118:20:41:44.851 cpu1:1110)WARNING: NFS: 4590: Can't find call with serial number -2146566064
esx04 kernel: nfs_statfs64: statfs error = 5
esx01 kernel: nfs_statfs: statfs error = 5
We have been investigating counters on the switches and on the filer. There seem to be some retransmitting of tcp packets occurring, but no dropped packages or packages with bad headers/invalid checksums or similar.
If these problems would be the result of high IO or latency on the filer, wouldn't the effect be slower transfers rather than VMs simply "losing" their disks for a short period of time?
The ESX hosts are HP DL360 G5, running ESX 3.5u4. The switches are Cisco 2960 (gigabit), with flow control disabled.
Any input on this matter is most welcomed!