Highlighted
Immortal
Immortal

VMs Crashing on NFS Storage ... ESXi Kernel Log flodded with messages

I've had a number of VMs fail to vMotion or straight up crash due to messages similar to these:

Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T15:19:23.749Z| vcpu-0| I125: >>> Error writing log, 89 bytes discarded. Disk full?

2018-12-28T15:19:23.749Z| vcpu-0| I125: Msg_Question:

2018-12-28T15:19:23.749Z| vcpu-0| I125: [msg.hbacommon.locklost] The lock protecting 'vm.vmdk' has been lost, possibly due to underlying storage issues.

2018-12-28T15:19:23.749Z| vcpu-0| I125+ If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK.

2018-12-28T15:19:23.749Z| vcpu-0| I125: ----------------------------------------

2018-12-28T15:23:24.379Z| vcpu-0| I125: Timing out dialog 3174

2018-12-28T15:23:24.380Z| vcpu-0| I125: MsgQuestion: msg.hbacommon.locklost reply=0

2018-12-28T15:23:24.380Z| vcpu-0| E105: PANIC: Exiting because of failed disk operation.

2018-12-28T15:23:25.575Z| vcpu-0| W115: A core file is available in "/vmfs/volumes/77955655-79af1105-0000-000000000000/vm_name/vm.zdump.000"

2018-12-28T15:23:25.575Z| mks| W115: Panic in progress... ungrabbing

2018-12-28T15:23:25.575Z| mks| I125: MKS: Release starting (Panic)

2018-12-28T15:23:25.575Z| mks| I125: MKS: Release finished (Panic)

2018-12-28T15:23:25.576Z| vcpu-0| I125: Writing monitor file `vmmcores.gz`

2018-12-28T15:23:25.578Z| vcpu-0| W115: Dumping core for vcpu-0

2018-12-28T15:23:25.578Z| vcpu-0| I125: CoreDump: dumping core with superuser privileges

2018-12-28T15:23:25.578Z| vcpu-0| I125: VMK Stack for vcpu 0 is at 0x439337d13000

2018-12-28T15:23:25.578Z| vcpu-0| I125: Beginning monitor coredump

2018-12-28T15:23:26.171Z| vcpu-0| I125: End monitor coredump

2018-12-28T15:23:26.172Z| vcpu-0| W115: Dumping core for vcpu-1

2018-12-28T15:23:26.172Z| vcpu-0| I125: CoreDump: dumping core with superuser privileges

2018-12-28T15:23:26.172Z| vcpu-0| I125: VMK Stack for vcpu 1 is at 0x439337f13000

2018-12-28T15:23:26.172Z| vcpu-0| I125: Beginning monitor coredump

2018-12-28T15:23:26.577Z| mks| W115: Panic in progress... ungrabbing

2018-12-28T15:23:26.577Z| mks| I125: MKS: Release starting (Panic)

2018-12-28T15:23:26.577Z| mks| I125: MKS: Release finished (Panic)

2018-12-28T15:23:26.759Z| vcpu-0| I125: End monitor coredump

2018-12-28T15:23:27.220Z| vcpu-0| I125: Printing loaded objects

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9922257000-0x9923324914): /bin/vmx

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x996391E000-0x99639351CC): /lib64/libpthread.so.0

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9963B3B000-0x9963B3CF00): /lib64/libdl.so.2

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9963D3F000-0x9963D47D08): /lib64/librt.so.1

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9963F5A000-0x99641F24E4): /lib64/libcrypto.so.1.0.2

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9964423000-0x996448C4AC): /lib64/libssl.so.1.0.2

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9964697000-0x99647AB37C): /lib64/libX11.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99649B2000-0x99649C101C): /lib64/libXext.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9964BC2000-0x9964CA6341): /lib64/libstdc++.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9964EC5000-0x9964F4521C): /lib64/libm.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9965148000-0x996515CBC4): /lib64/libgcc_s.so.1

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x996535E000-0x99654BEC74): /lib64/libc.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99236FD000-0x992371A7D8): /lib64/ld-linux-x86-64.so.2

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99656C9000-0x99656E3634): /lib64/libxcb.so.1

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99658E5000-0x99658E695C): /lib64/libXau.so.6

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9965EB5000-0x9965F4A534): /usr/lib64/vmware/plugin/objLib/upitObjBE.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9966163000-0x99662B9994): /usr/lib64/vmware/plugin/objLib/vsanObjBE.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9966550000-0x9966564F94): /lib64/libz.so.1

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99669AF000-0x99669BA1D0): /lib64/libnss_files.so.2

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9966CC5000-0x9966CC60A4): /usr/lib/vmware/nas_plugins/lib64/libvmfsNasPlugin.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9966EC8000-0x9966EDB44C): /lib64/libvmkfslib.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99670DD000-0x996733A6D4): /lib64/libvmlibs.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x99675E1000-0x99678F91A4): /lib64/libvmsnapshot.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9967B13000-0x9967B20F14): /lib64/libtreestructs.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9967D23000-0x9967D245BC): /usr/lib/vmware/nas_plugins/lib64/libsvaNasPlugin.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: [0x9967F26000-0x9967F4236C): /usr/lib/vmware/nas_plugins/lib64/libnvdkTegileNasPlugin.so

2018-12-28T15:23:27.220Z| vcpu-0| I125: End printing loaded objects

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace:

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[0] 0000009969aa6460 rip=00000099228d75e7 rbx=00000099228d70e0 rbp=0000009969aa6480 r12=0000000000000000 r13=0000000000000001 r14=0000009923ba1460 r15=0000009968371020

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[1] 0000009969aa6490 rip=000000992241766c rbx=0000009969aa64b0 rbp=0000009969aa6990 r12=000000992357c2b0 r13=0000000000000001 r14=0000009923ba1460 r15=0000009968371020

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[2] 0000009969aa69a0 rip=0000009922582d41 rbx=0000009923d68a40 rbp=0000009969aa69c0 r12=0000000000000002 r13=0000009923c9eb48 r14=0000009923ba1460 r15=0000009968371020

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[3] 0000009969aa69d0 rip=000000992257588d rbx=0000009923c9b030 rbp=0000009969aa6b70 r12=00000099233e3550 r13=0000000000000165 r14=0000009923ba1460 r15=0000009968371020

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[4] 0000009969aa6b80 rip=0000009922825431 rbx=00000099236904a8 rbp=0000009969aa6bb0 r12=00000099233e3550 r13=0000000000000165 r14=0000009923ba1460 r15=0000000000000000

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[5] 0000009969aa6bc0 rip=000000992284d176 rbx=000000000000012d rbp=0000009969aa6c00 r12=000000992357c2b0 r13=0000009923687d80 r14=00000099235621a0 r15=0000000000000000

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[6] 0000009969aa6c10 rip=0000009922825518 rbx=0000000000000000 rbp=0000009969aa6c20 r12=0000009969aa6c40 r13=0000009923dbb290 r14=0000009923df18d0 r15=0000000000000003

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[7] 0000009969aa6c30 rip=00000099228c2b05 rbx=0000009923691aa0 rbp=0000009969aa6d80 r12=0000009969aa6c40 r13=0000009923dbb290 r14=0000009923df18d0 r15=0000000000000003

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[8] 0000009969aa6d90 rip=0000009963925cfc rbx=0000000000000000 rbp=0000000000000000 r12=000003805f591a00 r13=0000009969aa79c0 r14=0000009923df18d0 r15=0000000000000003

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[9] 0000009969aa6ea0 rip=000000996542fead rbx=0000000000000000 rbp=0000000000000000 r12=000003805f591a00 r13=0000009969aa79c0 r14=0000009923df18d0 r15=0000000000000003

2018-12-28T15:23:27.220Z| vcpu-0| I125: Backtrace[10] 0000009969aa6ea8 rip=0000000000000000 rbx=0000000000000000 rbp=0000000000000000 r12=000003805f591a00 r13=0000009969aa79c0 r14=0000009923df18d0 r15=0000000000000003

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[0] 0000009969aa6460 rip=00000099228d75e7 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[1] 0000009969aa6490 rip=000000992241766c in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[2] 0000009969aa69a0 rip=0000009922582d41 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[3] 0000009969aa69d0 rip=000000992257588d in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[4] 0000009969aa6b80 rip=0000009922825431 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[5] 0000009969aa6bc0 rip=000000992284d176 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[6] 0000009969aa6c10 rip=0000009922825518 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[7] 0000009969aa6c30 rip=00000099228c2b05 in function (null) in object /bin/vmx loaded at 0000009922257000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[8] 0000009969aa6d90 rip=0000009963925cfc in function (null) in object /lib64/libpthread.so.0 loaded at 000000996391e000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[9] 0000009969aa6ea0 rip=000000996542fead in function clone in object /lib64/libc.so.6 loaded at 000000996535e000

2018-12-28T15:23:27.220Z| vcpu-0| I125: SymBacktrace[10] 0000009969aa6ea8 rip=0000000000000000

2018-12-28T15:23:27.220Z| vcpu-0| I125: Exiting

Here is a sample of what's going on in vmkernel.log

2018-12-28T15:19:23.751Z warning hostd[9A81B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/77955655-79af1105-0000-000000000000/vm/vm.vmx] Failed to find activation record, event user unknown.

2018-12-28T15:19:23.752Z verbose hostd[9A81B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: latestPage[33726], session[52610cea-e1e8-e07e-26d6-259c4517645b]52e66a9c-32b6-479e-319f-ca34434ae415. Applied change to temp map.

2018-12-28T15:19:23.752Z verbose hostd[9A81B70] [Originator@6876 sub=PropertyProvider] RecordOp ADD: latestPage[33736], session[52610cea-e1e8-e07e-26d6-259c4517645b]52e66a9c-32b6-479e-319f-ca34434ae415. Applied change to temp map.

2018-12-28T15:19:23.752Z verbose hostd[9A81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: latestEvent, ha-eventmgr. Applied change to temp map.

2018-12-28T15:19:23.752Z info hostd[9A81B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 33736 : Message on vm on esxi in ha-datacenter: The lock protecting 'vm.vmdk' has been lost, possibly due to underlying storage issues.

--> If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK.

-->

2018-12-28T16:00:05.215Z cpu57:1437367)WARNING: NFS41: NFS41FileOpSetFileAttributes:4540: Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T16:00:05.616Z cpu48:1437367)WARNING: NFS41: NFS41FileOpSetFileAttributes:4540: Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T16:00:06.019Z cpu48:1437367)WARNING: NFS41: NFS41FileOpSetFileAttributes:4540: Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T16:00:06.414Z cpu48:1437367)WARNING: NFS41: NFS41FileOpSetFileAttributes:4540: Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T16:00:06.814Z cpu48:1437367)WARNING: NFS41: NFS41FileOpSetFileAttributes:4540: Failed to process the set attributes result: Lost previously held disk lock

2018-12-28T16:00:10.623Z cpu7:68509)WARNING: NFS41: NFS41_NFSAttrToVolatileFSAttrs:1641: Server FILES_FREE 5779000868 exceeds ESX limit

2018-12-28T16:00:10.623Z cpu7:68509)WARNING: NFS41: NFS41_NFSAttrToVolatileFSAttrs:1654: Server FILES_TOTAL 5779005873 exceeds ESX limit

2018-12-28T16:00:10.629Z cpu7:68509)WARNING: NFS41: NFS41_NFSAttrToVolatileFSAttrs:1641: Server FILES_FREE 5779000868 exceeds ESX limit

2018-12-28T16:00:10.629Z cpu7:68509)WARNING: NFS41: NFS41_NFSAttrToVolatileFSAttrs:1654: Server FILES_TOTAL 5779005873 exceeds ESX limit

2018-12-28T16:00:10.633Z cpu7:68509)WARNING: NFS41: NFS41_NFSAttrToVolatileFSAttrs:1641: Server FILES_FREE 5779000868 exceeds ESX limit

2018-12-28T16:00:20.596Z cpu20:67009)SunRPC: 1116: Destroying world 0x1a0157

2018-12-28T16:00:20.598Z cpu20:67009)SunRPC: 1116: Destroying world 0x1a018e

2018-12-28T16:00:20.628Z cpu44:66542)WARNING: NFS: 221: Got error 13 from mount call

2018-12-28T16:00:20.628Z cpu20:67009)SunRPC: 1116: Destroying world 0x1a018f

2018-12-28T16:00:20.630Z cpu20:67009)SunRPC: 1116: Destroying world 0x1a0190

2018-12-28T16:00:20.649Z cpu44:66542)WARNING: NFS: 221: Got error 13 from mount call

2018-12-28T16:00:20.649Z cpu20:67009)SunRPC: 1116: Destroying world 0x1a0191

vobd shows the VM crash happening

It shows an APD on the NFS storage well before the crash itself, maybe about an hour before:

2018-12-28T11:17:56.077Z: [APDCorrelator] 2034244476342us: [vob.storage.apd.start] Device or filesystem with identifier [77955655-79af1105-0000-000000000000] has entered the All Paths Down state.

0 Kudos
5 Replies
Highlighted
Immortal
Immortal

These issues can be tricky, but you haven't mentioned what version of vSphere/ESXi, build, or type of storage you're using. It may be best for you to gather logs and open an SR with VMware to determine where to look. Since it appears you may be using NFS 4.1, you might also need to engage your storage vendor.

0 Kudos
Highlighted
Immortal
Immortal

Yeah, I should have included those bits.

VMware ESXi, 6.5.0, 10884925

PowerEdge R820

The storage is a Tegile NAS device.  I don't know the exact model off the top of my head.

As for support, I have to go through someone else to get that ... Just found it odd that some of those messages aren't really online anywhere at all.

0 Kudos
Highlighted
Immortal
Immortal

First thing you may want to do need to do is check the Tegile best practices guide for this array with vSphere over NFS 4.1 and compare those recommendations with the ESXi hosts to ensure everything is being met. If you've done so and everything is copacetic, then you may want to speak with Tegile directly. I have not used any of their arrays in a number of years so I don't have any direct, modern advice to offer.

0 Kudos
Highlighted
Contributor
Contributor

IamTHEvilONE

Sorry to resurrect an old thread, but I am curious if you ever discovered the cause of this issue or found a resolution?  I am seeing the 'exact' same issue with NFS 4.1 datastores mounting from a Tegile.

0 Kudos
Highlighted
Immortal
Immortal

Hey DanP​ ... sorry I haven't logged in for a while.  It ended up that the storage device didn't support NFS 4.1 under ESXi ... or at least that we could tell.  Using NFS v3 seems to be the solution and native plugin.  We also significantly reduced load over time from the NAS device which initially made us think it was just overloaded ... but alas ... changing protocol was the biggest help.

0 Kudos