Hi, I've had a platinum case open for this for a while without a resolution
ESX 4 U1 208167
Netapp 3040 ONTAP 7.3.1.1
Last night at 12:01-12:02 am our ESX hosts logged these and took several VMS out (needed to power cycle them)
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803a710 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803ae10 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008033c50 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803ac50 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x4100080377d0 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008037ed0 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803b510 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803aa90 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803b890 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x410008033550 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803bc10 4
Dec 18 00:01:56 irt-esx65 vmkernel: 10:09:50:54.152 cpu2:4098)NFSLock: 584: Stop accessing fd 0x41000803bf90 4
Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.505 cpu2:6778)VSCSI: 2135: handle 8213(vscsi0:1):Reset request on FSS handle 704652 (6 outstanding commands)
Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.506 cpu1:4161)VSCSI: 2395: handle 8213(vscsi0:1):Reset <strike><strike>Retries: 0/0</strike></strike>
Dec 18 00:02:02 irt-esx65 vmkernel: 10:09:51:00.506 cpu1:4161)VSCSI: 2207: handle 8213(vscsi0:1):Completing reset (0 outstanding commands)
Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.806 F5C60B90 info 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Disconnect check in progress.
Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.807 F63ABB90 info 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Question info: NVRAM: read failed
Dec 18 00:02:06 irt-esx65 VMware[init]: , Id: 1 : Type : 3, Default: 0, Number of options: 1
Dec 18 00:02:06 irt-esx65 VMware[init]: <strike><strike>2009-12-17 23:49:29.807 F6329B90 warning 'vm:/vmfs/volumes/f3a64512-358f6a4f/irt-admin-01/minCentos-irt.vmx'</strike></strike> Received a duplicate transition from foundry: 8, 1
VCP
This happening at midnight makes me think it's a scheduled job that interrupted the NFS traffic, snapshot, dedup, backup over NFS or similar on the Netapp maybe? Any performance numbers for this time? Were the error messages followed by "Start accessing fd 0x" again? Were you able to access the Netapp any other way or from other hosts during this time?
I've had far too many of these events while working with "another storage vendor that isn't NetApp" and I learned that if the VM's had a fresh vmware tools install most of them would survive, but some linux systems remount their filesystems as readonly (thus requiring a reboot) and older OS's might simply just crash during a storage outage.
Hope it helps!
Yes, the frozen VMs were windows OS with latest vmwaretools installed - did not seem to help
Also I have the recommended NFS ESX settings configured via NetApp's VSC (NFS heartbeats, TCP heap etc)
Yes,there are subsequent messages "Start accessing fd... again" - unfortunately the damage is done already to these frozen VMs
VCP
Netapp recommended turning off dedup until we get this bug fixed:
Deduplication performance degradation in some VMWare use-cases with lot of overwrites in the volume
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=356153
We've turned off de-dup on all volumes for now
VCP
Good to know, I'm setting up new Netapps for ESX this week but luckily it seems to be fixed in OnTap 7.3.2.
We had another big latency spike in the Netapp NFS last night at 8:20pm (with the dedup off)
Took out several important VMs
We're escalating to Netapp support
VCP
Well that sucks..let me know what you find.
over the holiday break we updated all the vmware tools and hardware
(v4->v7) and the issue has not occurred since.
VMware support now says this was the solution all along!
I asked them to point me to the technical document alerting customers to
the importance of updating vmware-tools lest they get VM dataloss and
they said it did not exist. I asked what the procedure was for
requesting this document be created and they said they would not unless
they could reproduce it!
MAKE SURE YOUR VMWARE-TOOLS ARE COMPLETELY UP TO DATE - is what vmware
should immediately have told you.
also check out nfstop to identify which VMs are soaking up your limited
supply of IOPS:
http://communities.vmware.com/message/1462324
I ended up migrating a few IO hogs to local disk off the NFS datastores.
VCP
I know I am late to this part, however the same thign has happened with my environment as well.
Fine all along but then every Friday at 1:15pm-ish, same errors/symptoms.
Sent cases to VMware and NetApp. NetApp analyzed and noted to update tools (odd that tools would need to be updated because of a random thing at 1:15pm weekly) but I am working that. VMware couldn't even analyze the same VMware logs that NetApp could; therefore their support almost useless.
Problem not resolved, yet, here, but working on the VMtools update (so many systems) and NetApp still running with the case (collecting all kinds of information).
If something materializes I'll post here.