VMware Cloud Community
fletch00
Enthusiast
Enthusiast

vSphere NFSLock then VMs freeze

Hi, I've had a platinum case open for this for a while without a resolution

ESX 4 U1 208167

Netapp 3040 ONTAP 7.3.1.1

Last night at 12:01-12:02 am our ESX hosts logged these and took several VMS out (needed to power cycle them)

Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803a710 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803ae10 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008033c50 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803ac50 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x4100080377d0 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008037ed0 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803b510 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803aa90 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803b890 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008033550 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803bc10 4


Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803bf90 4


Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.505 cpu2:6778)VSCSI: 2135: handle 8213(vscsi0:1):Reset request on FSS handle 704652 (6 outstanding commands)


Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.506 cpu1:4161)VSCSI: 2395: handle 8213(vscsi0:1):Reset <strike><strike>Retries: 0/0</strike></strike>


Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.506 cpu1:4161)VSCSI: 2207: handle 8213(vscsi0:1):Completing reset (0 outstanding commands)


Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.806 F5C60B90 info 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Disconnect check in progress.


Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.807 F63ABB90 info 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Question info: NVRAM: read failed


Dec 18 00:02:06 irt-esx65 VMware[init]: , Id: 1 : Type : 3, Default: 0, Number of options: 1


Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.807 F6329B90 warning 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Received a duplicate transition from foundry: 8, 1


VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
8 Replies
dnetz
Hot Shot
Hot Shot

This happening at midnight makes me think it's a scheduled job that interrupted the NFS traffic, snapshot, dedup, backup over NFS or similar on the Netapp maybe? Any performance numbers for this time? Were the error messages followed by "Start accessing fd 0x" again? Were you able to access the Netapp any other way or from other hosts during this time?

I've had far too many of these events while working with "another storage vendor that isn't NetApp" and I learned that if the VM's had a fresh vmware tools install most of them would survive, but some linux systems remount their filesystems as readonly (thus requiring a reboot) and older OS's might simply just crash during a storage outage.

Hope it helps!

0 Kudos
fletch00
Enthusiast
Enthusiast

Yes, the frozen VMs were windows OS with latest vmwaretools installed - did not seem to help

Also I have the recommended NFS ESX settings configured via NetApp's VSC (NFS heartbeats, TCP heap etc)

http://blogs.netapp.com/storage_nuts_n_bolts/2009/10/netapp-virtual-storage-console-vsc-for-esx-read...

Yes,there are subsequent messages "Start accessing fd... again" - unfortunately the damage is done already to these frozen VMs

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
fletch00
Enthusiast
Enthusiast

Netapp recommended turning off dedup until we get this bug fixed:

Deduplication performance degradation in some VMWare use-cases with lot of overwrites in the volume

http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=356153

We've turned off de-dup on all volumes for now

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
dnetz
Hot Shot
Hot Shot

Good to know, I'm setting up new Netapps for ESX this week but luckily it seems to be fixed in OnTap 7.3.2.

0 Kudos
fletch00
Enthusiast
Enthusiast

We had another big latency spike in the Netapp NFS last night at 8:20pm (with the dedup off)

Took out several important VMs

We're escalating to Netapp support

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
dnetz
Hot Shot
Hot Shot

Well that sucks..let me know what you find.

0 Kudos
fletch00
Enthusiast
Enthusiast

over the holiday break we updated all the vmware tools and hardware

(v4-&gt;v7) and the issue has not occurred since.

VMware support now says this was the solution all along!

I asked them to point me to the technical document alerting customers to

the importance of updating vmware-tools lest they get VM dataloss and

they said it did not exist. I asked what the procedure was for

requesting this document be created and they said they would not unless

they could reproduce it!

MAKE SURE YOUR VMWARE-TOOLS ARE COMPLETELY UP TO DATE - is what vmware

should immediately have told you.

also check out nfstop to identify which VMs are soaking up your limited

supply of IOPS:

http://communities.vmware.com/message/1462324

I ended up migrating a few IO hogs to local disk off the NFS datastores.

VCP

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
cxo
Contributor
Contributor

I know I am late to this part, however the same thign has happened with my environment as well.

Fine all along but then every Friday at 1:15pm-ish, same errors/symptoms.

Sent cases to VMware and NetApp.   NetApp analyzed and noted to update tools (odd that tools would need to be updated because of a random thing at 1:15pm weekly) but I am working that.  VMware couldn't even analyze the same VMware logs that NetApp could; therefore their support almost useless.

Problem not resolved, yet, here, but working on the VMtools update (so many systems) and NetApp still running with the case (collecting all kinds of information).

If something materializes I'll post here.

0 Kudos