I have a small home computer converted towards a ESXi server. This I have done now for about 4 years.
It is a ASRock Desktop Mini with a i3 processor, 32GB of RAM and SSD M2 disk.
Because I had issues before with losing power suddenly in my house and then had to risk installing VMWare again, I have installed a SanDisk USB drive of about 64GB. On this USB drive I have installed VMWare and the SSD is used to store all data of the virtual machines.
But I moved to another house and since this moment I get a strange experience with the ESXi server, because it stops responding at a certain moment and then it also doesn't respond anymore to a physical keyboard. It still shows that I can press F2, but it doesn't work anymore.
I then have to hold the power button to turn off the system and restart the system. Until yesterday I was using a older version of ESXi (build 16324942), so I thought lets try to install a newer version (upgrade did give me an error, so I chose to just select install and install the new version on the USB drive again).
But within a hour (without connecting any VM again) it already chrashed again and I wasn't able to get it responding without holding the power button again, afterwards I was able to connect the servers again. But this morning again it didn't respond again.
I turned on the device around 07:17 and the latest logs are from 02:43 this morning.
Anyone any idea perhaps what I could do to fix this?
Inside the attachment you can find all logs combined from 02:00 until 07:17
I had such Issue a year ago in the prod environment. VMs were responsive, but I was not able to migrate them or to do a backup.
If you are able to test that Host do the following: Via ssh run the following command: /etc/init.d/sfcbd-watchdog stop
On ESXi Host stop service CIM Provider. I have updated iLO and BIOS Versions to the newest. All Drivers and Firmware too.
I know you want a concrete solution but all these steps helped me to solve that issue;
Thanks for your help already.
But in my case the VM's are also not responding, this is how I noticed that the main ESXi server was having issues.
This morning again, the same issue, system was still running, but containers were unavailble and the ESXi host didn't respond anymore also not to direct access via keyboard and display on the device. The display still shows the grayed out version, but the keyboard input is no longer working.
Rebooted this device, and have executed the command you mentioned:
/etc/init.d/sfcbd-watchdog stop
Besides this I found Issuing a 0x85 SCSI command from a VMware ESXi 6.0 host with the EMC XtremIO storage array may resul... because I saw this error as well in the log "cpu3:524635)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times". So I also executed:
/etc/init.d/smartd stop
chkconfig smartd off
Because I'm using a USB drive to run ESXi from, I doubt that the device is still fully working. Perhaps this device has some issues what is causing ESXi to crash. I think I will also try to install ESXi on another USB device when the system freezes again, to check if another USB device can fix the problem. Because I see these errors inside the "vmkernel.log":
2021-11-26T20:26:22.652Z cpu1:524312)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x28 (0x453a411d66c0, 526185) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-11-26T20:26:22.652Z cpu1:524312)NMP: nmp_ThrottleLogForDevice:3865: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE. cmdId.initiator=0x43034bcaccc0 CmdSN 0xc
2021-11-26T20:26:22.652Z cpu1:524312)ScsiDeviceIO: 4062: Cmd(0x453a411d66c0) 0x28, CmdSN 0xc from world 526185 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
2021-11-26T20:26:22.652Z cpu2:526651)Vol3: 1982: Couldn't read volume header from 60749f38-e295605f-dd74-78f29e90b189: I/O error
2021-11-26T20:26:22.653Z cpu2:526651)Vol3: 4226: Failed to get object 28 type 1 uuid 60749f39-70241bc6-d9cc-78f29e90b189 FD 0 gen 0 :I/O error
2021-11-26T20:26:22.653Z cpu2:526651)WARNING: Fil3: 1518: Failed to reserve volume f533 28 1 60749f39 70241bc6 f278d9cc 89b1909e 0 0 0 0 0 0 0
2021-11-26T20:26:22.653Z cpu2:526651)Vol3: 4226: Failed to get object 28 type 2 uuid 60749f39-70241bc6-d9cc-78f29e90b189 FD 4 gen 1 :I/O error
I used the command below to check which drives have which names:
esxcli storage core path list
And the errors are related to the USB drive as far as I can see
usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0
UID: usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0
Runtime Name: vmhba32:C0:T0:L0
Device: mpx.vmhba32:C0:T0:L0
Device Display Name: Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)
Adapter: vmhba32
Channel: 0
Target: 0
LUN: 0
Plugin: NMP
State: active
Transport: usb
Adapter Identifier: usb.vmhba32
Target Identifier: usb.0:0
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 32768
pcie.100-pcie.0:0-eui.0000000001000000e4d25c5331cb5201
UID: pcie.100-pcie.0:0-eui.0000000001000000e4d25c5331cb5201
Runtime Name: vmhba1:C0:T0:L0
Device: eui.0000000001000000e4d25c5331cb5201
Device Display Name: Local NVMe Disk (eui.0000000001000000e4d25c5331cb5201)
Adapter: vmhba1
Channel: 0
Target: 0
LUN: 0
Plugin: HPP
State: active
Transport: pcie
Adapter Identifier: pcie.100
Target Identifier: pcie.0:0
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 131072
Will get back when I now more.
are you able to patch that ESXi host?
Disabling the services didn't help. Later yesterday it froze again. I had unregistered all vm's, to be sure it wasn't because of them.
I installed the latest downloadable version of VMware esxi 7. So not sure which patch you referring to?
I will check if the installation on an different usb driver will help.
how many ESXi hosts do you have in that Cluster?
Does only a concrete one have such Issues? please check and compare their settings.
Have you configured the option to send dumps via dump collector? Maybe VMware support will be able to assist.
It is just one system, currently I'm trying out a couple of stress tests on the system to see if it is hardware related.
Will get back to you.
Update 19:30
After running memtest for about 2-3 hours it executed 6 iterations but all without any error.
The strange part was that it responded when I pressed ESC but the device didn't shutdown or kill the app so I had to hard shutdown the device again.
Not sure if that was a issue with memtest or something else is going on, but the cursor was still blinking, so it didn't look like it was fully frozen.
Currently started a CPU stress test, will get back on this later.
Update 20:17
I have executed several CPU stress tests, but they all passed.
So not sure what the exact cause then is, I have resetted the BIOS to its factory settings.
I have reinstalled VMWare ESXi on the USB drive but this time the version 8.
Not connected any VM yet, lets see if this one fails also.
Just moving to another house and you have issues? Maybe electrical issues like inconsistent/low voltage to the server or power fluctuations?
--Alan--
I have tried a couple of things last couple of days to find the root cause, but to be honest not sure if I found one yet.
So after chatting with some collegues they all think ats memory related, but because the memory tests didn't fail, I now tried to first clean the entire board with air. Disconnected the memory modules and reinstalled them.
Booted ESXi 8 yesterday around 13:00h and now at 0800h in the morning it is still responding. So lets see.
Cleaning the board and the memory with air pressure looked like to solve the issue, because it was working table for about 3 days. But today again the system froze. So decided to remove one of the two memory modules to see if it is one of the modules. Still really strange that I didn't find anything by testing the memory.
very interesting
Again earlier today the system froze, with one memory module.
So this evening I removed the memory module and replaced it with the other memory module inside the second memory bank.
Besides this I disconnected the FAN from the CPU and cleaned it and put on some new fresh cool pasta.
It did run fine now for about 2 hours, but now already the system is frozen again.
After shutting down the system and directly checking the temperature of the CPU inside the BIOS, it was only 50 degrees.
So I have no idea what is causing my system to freeze so frequently without the ability to reproduce it with tests.
I found on this page that it could be related to the ASRock DeskMini itself:
(1) Deskmini 110 power supply replacement? : ASRock (reddit.com)
They asked me:
I'm running throttled with this config and it has remained stable ever since.
But I'm not sure how to do this within ESXi, do you guys perhaps know how?
I'm having exactly the same problem here. This is very annoying. VMWare ESXi, 6.7.0, 13644319. Every now and then the system just froze, I have executed also some stress tests in the system. I'm running NFS in a separate server, bare metal, and I'm thinking something there is making everything messy. Please, someone has any idea of this?
I am having the same issue. It freeze intermittently. Seems like it is every couple days. However, it could be whenever I am doing a lot of VM creation. A restart of the ESXI fixes the problem temporarily.
Did you have any luck figuring out the issue?