Re: NFS connection to ALL ESX hosts keeps locking - VMware Technology Network VMTN

VMware Cloud Community

Hello

We have our all VMware clusters connected to shared NFS storage (Netapp FAS3040). Randomly we get following messages in /var/log/vmkernel. As You can see this situation lasts for about 15 seconds. During this time Windows VMs operate normally, but some linux VMs lock themselves, as for them this storage is "somehow" not available.

Jun 9 03:31:49 ESXHOST vmkernel: 51:14:09:20.705 cpu1:1585)WARNING: Swap: vm 1

584: 7515: Swap sync read failed: status=195887167, retrying...

Jun 9 03:31:49 ESXHOST vmkernel: 51:14:09:20.705 cpu3:1235)WARNING: Swap: vm 1

234: 7515: Swap sync read failed: status=195887167, retrying...

Jun 9 05:23:24 ESXHOST vmkernel: 51:16:00:54.994 cpu1:1599)Heap: 1856: tcpip:

heap below 9% -- 3002288 bytes free

Jun 10 05:29:32 ESXNAME vmkernel: 52:11:55:28.770 cpu4:1170)WARNING: Swap: vm 1

170: 7515: Swap sync read failed: status=195887167, retrying...

Jun 10 05:29:32 ESXNAME vmkernel: 52:11:55:28.770 cpu6:1105)WARNING: Swap: vm 1

105: 7515: Swap sync read failed: status=195887167, retrying...

Jun 10 05:29:32 ESXNAME vmkernel: 52:11:55:28.770 cpu5:1809)WARNING: Swap: vm 1

809: 7515: Swap sync read failed: status=195887167, retrying...

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21b330 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc2241a0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc225318 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc222c20 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc221950 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc229fb0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc2294f0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc227760 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc225dd8 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc228b88 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21df88 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc223028 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc22d820 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc227f70 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc228220 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc228f90 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc221548 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.770 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc227608 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc22cd60 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc2192f0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc221aa8 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc232cc8 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc2330d0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc223d98 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc219b00 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc2242f8 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21e238 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc224048 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc221eb0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21e0e0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21e390 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21a5c0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc21d0c0 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc221140 4

Jun 10 05:29:34 ESXNAME vmkernel: 52:11:55:30.771 cpu6:1030)NFSLock: 516: Stop

accessing fd 0xc223990 4

Jun 10 05:29:36 ESXNAME vmkernel: 52:11:55:32.771 cpu6:1162)WARNING: Swap: vm 1

162: 7515: Swap sync read failed: status=195887167, retrying...

Jun 10 05:29:45 ESXNAME vmkernel: 52:11:55:41.263 cpu0:1300)VSCSIFs: 441: fd 42 12 status No connection

*Can't find call with serial number -2146566134**

Jun 10 05:29:48 ESXNAME vmkernel: 52:11:55:44.816 cpu6:1398)WARNING: NFS: 4590: Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.336 cpu1:1100)VSCSIFs: 441: fd 41 10 status No connection

Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1809)NFSLock: 478: Start

*accessing fd 0xc221eb0 again**

Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1809)NFSLock: 478: Start accessing fd 0xc224048 again Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1809)NFSLock: 478: Start accessing fd 0xc21e238 again Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1809)NFSLock: 478: Start accessing fd 0xc2242f8 again Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1809)NFSLock: 478: Start accessing fd 0xc219b00 again Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1331)NFSLock: 478: Start accessing fd 0xc223d98 again Jun 10 05:29:49 ESXNAME vmkernel: 52:11:55:45.765 cpu5:1331)NFSLock: 478: Start accessing fd 0xc2330d0 again Linux VM with Debian 2.6.18-6-686 #1 during such problem (noticed before) sometimes (NOT ALWAYS) gets such message in /var/log/kern.log (please, don't bother dates, as I said today this VM didn't have this problem, but yesterday it had, so its mysql database went it "read-only" mode, so customer did have some downtime to this 😕

Jun 9 03:23:04 debian kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Jun 9 03:23:05 debian last message repeated 3 times

Here are messages file (from netapp) from this time (when this VM had this problem):

Tue Jun 9 02:57:58 CEST ~~filername@FILERNAME: TOE-2x10G/e2a:warning~~: Client IPADDRESS (xid 2974160467) is trying to access an unexported mount (fileid 64 on volume 0x16425e1a ) Tue Jun 9 03:00:00 CEST ~~FILERNAME: kern.uptime.filer:info~~: 3:00am up 402 days, 3:32 19791039559 NFS ops, 13567224346 CIFS ops, 1737 HTTP ops, 0 FCP ops, 403989472 iSCSI ops Tue Jun 9 03:08:13 CEST ~~filername@FILERNAME: Auth03:info~~: Lookup of ldap.tcp.default-first-site-name._sites.AOL.LOCAL failed with DNS server IPADDRESS: Connection timed out. Tue Jun 9 03:34:30 CEST ~~FILERNAME: ems.engine.inputSuppress:error~~: Event 'nlm.granted.pm.noResponse' suppressed 8 times since Tue Jun 9 00:32:13 CEST 2009. Tue Jun 9 03:34:30 CEST ~~filername@FILERNAME: nlm.granted.pm.noResponse:error~~: Portmapper daemon not responding to vers 2 requests on client IPADDRESS. Tue Jun 9 03:57:59 CEST ~~filername@FILERNAME: TOE-2x10G/e2a:warning~~: Client IPADDRESS (xid 1999574867) is trying to access an unexported mount (fileid 64 on volume 0x16425e1a )

and from this night (logs from netapp) Wed Jun 10 05:31:16 CEST ~~FILERNAME@FILERNAME: nlm.granted.pm.noResponse:error~~: Portmapper daemon not responding to vers 2 requests on client IPADDRESS. Wed Jun 10 05:58:04 CEST ~~FILERNAME@FILERNAME: TOE-2x10G/e2a:warning~~: Client IPADDRESS (xid 1898683475) is trying to access an unexported mount (fileid 64 on volume 0x16425e1a ) Wed Jun 10 06:00:00 CEST ~~FILERNAME: kern.uptime.filer:info~~: 6:00am up 403 days, 6:32 19999278702 NFS ops, 13650854157 CIFS ops, 1763 HTTP ops, 0 FCP ops, 408122288 iSCSI ops

I hope that someone did have similiar (orthe same) problem with Netapp+VMware. This occurs both on ESX3.5 clusters as well as on vSphere4 cluster. Thanks in advance for any ideas

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]

4 Replies

Have you tried to move the VM swap file from NFS to a local datastore?

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

Hi,

Have you found the cause of your NFS problems yet?

We have almost identical problems! See my thread at http://communities.vmware.com/thread/237699

Our shared NFS storage is an IBM N3300 A20 (basically a rebranded Netapp FAS2020).

We've only recently gotten the "swap sync" like the ones in your thread, but we've had the NFS issue for several months. I think that's because our VMs are not heavily loaded and barely uses the swap. So i don't think the cause of the problem is that we have the VM swap files on the NFS filer. We've experienced the root file system going read-only on some Debian Lenny VMs during these "glitches", much like you describe. The solution for this is to remove the "errors=remount-ro" option in /etc/fstab on the Debian VMs and remount /. Our Debian VMs are always affected, but the default SCSI timeout setting is 30 seconds on Debian. We've seen this just a few times on CentOS VMs as well, but they have a SCSI timeout of 60 seconds. We've never gotten any event log messages on our Windows VMs regarding this, so they probably have an ever higher timeout.

Please read my thread and give input if you've found out some more regarding this issue.

I just wrote this vscsistats wrapper to dump the total IOs per VM -start collections first:

/usr/lib/vmware/bin/vscsiStats -s

$ cat showTotalIOsPerVM.sh

#!/bin/bash

/usr/lib/vmware/bin/vscsiStats -l | egrep worldGroupID | awk '{print $9 " " $4}' > /tmp/vmnames

for i in ` /usr/lib/vmware/bin/vscsiStats -l | egrep worldGroupID | awk '{print $4}'`

do

echo `/usr/lib/vmware/bin/vscsiStats -p latency -w $i -c | head -17 | tail -11 | awk -F, '{ SUM += $1} END { print SUM }'` " " `egrep $i /tmp/vmnames `

done

Then I wrap it into a once per minute log:

#!/bin/bash

while

do

./showTotalIOsPerVM.sh | sort -nr | tee ios.`date +%m%d%H%M%S`

sleep 60

done

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info

I just saw your answer in here. Well the problem is, that when you use NFS, there is no IO from VM to storage reported. And when I try vscsistats with -p latency, I get all 0.