I have ESXI 5.5 update 1 running on a Intel NUC and I've seen it get stupid twice now in six weeks. The symptoms are that the VMs are not accessible, although they show up as green/running in the Windows client. At that time the other VMs that are known are greyed out as unavailable in the gui, and when I log into the ESXi host via SSH a df looks very strange with the /vmfs/volumes/vm entry showing zero bytes for all columns....
~ # df
Filesystem Bytes Used Available Use% Mounted on
VMFS-5 0 0 0 0% /vmfs/volumes/vm
vfat 261853184 165478400 96374784 63% /vmfs/volumes/6bd3fde8-6e085f5c-08fa-706744cb5db9
vfat 261853184 165498880 96354304 63% /vmfs/volumes/104e0cef-148f4ff6-92f4-23c5628c7b64
vfat 299712512 202006528 97705984 67% /vmfs/volumes/53f1e12f-31e9cdc4-de70-c03fd566c7a4
If I use the Window client to shut down the VMs and put the host into maintenance mode, I can reboot it ok. When it comes back up all is fine, although I then need to get the host out of maintenance mode and restart the clients. The 'df' looks normal again....
~ # df
Filesystem Bytes Used Available Use% Mounted on
VMFS-5 536602476544 385139867648 151462608896 72% /vmfs/volumes/vm
vfat 261853184 165478400 96374784 63% /vmfs/volumes/6bd3fde8-6e085f5c-08fa-706744cb5db9
vfat 261853184 165498880 96354304 63% /vmfs/volumes/104e0cef-148f4ff6-92f4-23c5628c7b64
vfat 299712512 202006528 97705984 67% /vmfs/volumes/53f1e12f-31e9cdc4-de70-c03fd566c7a4
Any ideas (a) what is going on, (b) which logs to look for what in, and (c) how to prevent it ?
I have to admit being lost in all the vWhatever product buzzwords and strange cli interfaces and logfiles, so be nice
Host is losing access to the VMFS datastore and possibly to the underlying storage.
Please attach the vmkernel logs from the host when the issue is seen.
Hello vds,
I got a chance to review the vmkernel logs uploaded by you and noticed below messages:
vmkernel.log snippet
-----------------------------
VC opID hostd-d62d maps to vmkernel opID d900a9ad
2014-10-02T05:54:40.003Z cpu0:33932)World: 14296: VC opID hostd-e388 maps to vmkernel opID f4bc832b
2014-10-02T05:55:00.004Z cpu2:34461)World: 14296: VC opID hostd-d62d maps to vmkernel opID d900a9ad
2014-10-02T05:55:20.002Z cpu0:44713)World: 14296: VC opID hostd-501b maps to vmkernel opID 94a005aa
2014-10-02T05:56:00.003Z cpu0:33939)World: 14296: VC opID hostd-d62d maps to vmkernel opID d900a9ad
2014-10-02T05:56:20.004Z cpu0:44713)World: 14296: VC opID hostd-ed47 maps to vmkernel opID edabca69
2014-10-02T05:56:56.242Z cpu1:33386)<3>ata1.00: exception Emask 0x10 SAct 0x2 SErr 0x280100 action 0x6 frozen
2014-10-02T05:56:56.242Z cpu1:33386)<3>ata1.00: irq_stat 0x09000000, interface fatal error
2014-10-02T05:56:56.242Z cpu1:33386)<3>ata1: SError: { UnrecovData 10B8B BadCRC }
2014-10-02T05:56:56.242Z cpu1:33386)<3>ata1.00: cmd 60/20:08:dd:ec:fb/00:00:38:00:00/40 tag 1 ncq 16384 in
res 40/00:0c:dd:ec:fb/00:00:38:00:00/40 Emask 0x10 (ATA bus error)
2014-10-02T05:56:56.242Z cpu1:33386)<3>ata1.00: status: { DRDY }
2014-10-02T05:56:56.242Z cpu1:33386)<6>ata1: hard resetting link
2014-10-02T05:57:00.003Z cpu0:33939)World: 14296: VC opID hostd-2335 maps to vmkernel opID 70fae8a4
2014-10-02T05:57:01.769Z cpu3:33386)<4>ata1: port is slow to respond, please be patient (Status 0x80)
2014-10-02T05:57:06.265Z cpu0:33386)<3>ata1: COMRESET failed (errno=-16)
2014-10-02T05:57:06.265Z cpu0:33386)<6>ata1: hard resetting link
2014-10-02T05:57:06.364Z cpu0:32785)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x412e80873b40, 32777) to dev "t10.ATA_____WDC_WD10JFCX2D68N6GN0_________________________WD2DWX71A44E6438" on path "vmhba0:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sen$
2014-10-02T05:57:06.365Z cpu0:32785)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____WDC_WD10JFCX2D68N6GN0_________________________WD2DWX71A44E6438" state in doubt; requested fast path state update...
2014-10-02T05:57:06.365Z cpu0:32785)ScsiDeviceIO: 2337: Cmd(0x412e80873b40) 0x2a, CmdSN 0xa0015 from world 32777 to dev "t10.ATA_____WDC_WD10JFCX2D68N6GN0_________________________WD2DWX71A44E6438" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x$
2014-10-02T05:57:06.768Z cpu2:33386)<6>ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
2014-10-02T05:57:06.770Z cpu2:33386)<6>ata1.00: configured for UDMA/133
2014-10-02T05:57:06.770Z cpu2:33386)<6>ata1: EH complete
2014-10-02T05:57:06.770Z cpu1:33274)ScsiDeviceIO: 2324: Cmd(0x412e80840f80) 0x28, CmdSN 0xa0013 from world 35470 to dev "t10.ATA_____WDC_WD10JFCX2D68N6GN0_________________________WD2DWX71A44E6438" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x0 0x0.
2014-10-02T05:57:20.002Z cpu1:34461)World: 14296: VC opID hostd-d62d maps to vmkernel opID d900a9ad
2014-10-02T05:57:40.002Z cpu2:44713)World: 14296: VC opID hostd-06eb maps to vmkernel opID 646d9824
2014-10-02T05:57:51.404Z cpu2:33939)World: 14296: VC opID hostd-d62d maps to vmkernel opID d900a9ad
2014-10-02T05:58:00.003Z cpu0:33932)World: 14296: VC opID hostd-2335 maps to vmkernel opID 70fae8a4
2014-10-02T05:58:55.481Z cpu1:33932)World: 14296: VC opID hostd-2335 maps to vmkernel opID 70fae8a4
2014-10-02T05:59:00.002Z cpu2:33939)World: 14296: VC opID hostd-ed47 maps to vmkernel opID edabca69
2014-10-02T05:59:40.004Z cpu2:44713)World: 14296: VC opID hostd-b854 maps to vmkernel opID 44dad3cd
2014-10-02T05:59:51.410Z cpu0:33939)World: 14296: VC opID hostd-2335 maps to vmkernel opID 70fae8a4
2014-10-02T06:00:00.002Z cpu3:33932)
There are ATA bus errors, interface bus errors and COMRESET failures. This can be isolated to be an issue with the controller used on the server. Also I would recommend to contact your hardware vendor and ask them to run a hardware diagnostic check on the server.
