VMware Cloud Community
Josvds
Contributor
Contributor

ESXi suddently starts freezing several times a week

I have a small home computer converted towards a ESXi server. This I have done now for about 4 years. 

It is a ASRock Desktop Mini with a i3 processor, 32GB of RAM and SSD M2 disk.

Because I had issues before with losing power suddenly in my house and then had to risk installing VMWare again, I have installed a SanDisk USB drive of about 64GB. On this USB drive I have installed VMWare and the SSD is used to store all data of the virtual machines.

Josvds_0-1665727017581.png

But I moved to another house and since this moment I get a strange experience with the ESXi server, because it stops responding at a certain moment and then it also doesn't respond anymore to a physical keyboard. It still shows that I can press F2, but it doesn't work anymore.

I then have to hold the power button to turn off the system and restart the system. Until yesterday I was using a older version of ESXi (build 16324942), so I thought lets try to install a newer version (upgrade did give me an error, so I chose to just select install and install the new version on the USB drive again).

But within a hour (without connecting any VM again) it already chrashed again and I wasn't able to get it responding without holding the power button again, afterwards I was able to connect the servers again. But this morning again it didn't respond again.

I turned on the device around 07:17 and the latest logs are from 02:43 this morning. 

Josvds_1-1665727550179.png

Anyone any idea perhaps what I could do to fix this?

Inside the attachment you can find all logs combined from 02:00 until 07:17

 

Labels (2)
Reply
0 Kudos
13 Replies
maksym007
Expert
Expert

I had such Issue a year ago in the prod environment. VMs were responsive, but I was not able to migrate them or to do a backup. 

If you are able to test that Host do the following: Via ssh run the following command:  /etc/init.d/sfcbd-watchdog stop

On ESXi Host stop service CIM Provider. I have updated iLO and BIOS Versions to the newest. All Drivers and Firmware too.  

 

I know you want a concrete solution but all these steps helped me to solve that issue; 

Reply
0 Kudos
Josvds
Contributor
Contributor

Thanks for your help already. 

But in my case the VM's are also not responding, this is how I noticed that the main ESXi server was having issues.

This morning again, the same issue, system was still running, but containers were unavailble and the ESXi host didn't respond anymore also not to direct access via keyboard and display on the device. The display still shows the grayed out version, but the keyboard input is no longer working.

Rebooted this device, and have executed the command you mentioned:

/etc/init.d/sfcbd-watchdog stop

Besides this I found Issuing a 0x85 SCSI command from a VMware ESXi 6.0 host with the EMC XtremIO storage array may resul... because I saw this error as well in the log "cpu3:524635)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times". So I also executed:

/etc/init.d/smartd stop
chkconfig smartd off

Because I'm using a USB drive to run ESXi from, I doubt that the device is still fully working. Perhaps this device has some issues what is causing ESXi to crash. I think I will also try to install ESXi on another USB device when the system freezes again, to check if another USB device can fix the problem. Because I see these errors inside the "vmkernel.log":

2021-11-26T20:26:22.652Z cpu1:524312)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x28 (0x453a411d66c0, 526185) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-11-26T20:26:22.652Z cpu1:524312)NMP: nmp_ThrottleLogForDevice:3865: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE. cmdId.initiator=0x43034bcaccc0 CmdSN 0xc
2021-11-26T20:26:22.652Z cpu1:524312)ScsiDeviceIO: 4062: Cmd(0x453a411d66c0) 0x28, CmdSN 0xc from world 526185 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
2021-11-26T20:26:22.652Z cpu2:526651)Vol3: 1982: Couldn't read volume header from 60749f38-e295605f-dd74-78f29e90b189: I/O error
2021-11-26T20:26:22.653Z cpu2:526651)Vol3: 4226: Failed to get object 28 type 1 uuid 60749f39-70241bc6-d9cc-78f29e90b189 FD 0 gen 0 :I/O error
2021-11-26T20:26:22.653Z cpu2:526651)WARNING: Fil3: 1518: Failed to reserve volume f533 28 1 60749f39 70241bc6 f278d9cc 89b1909e 0 0 0 0 0 0 0
2021-11-26T20:26:22.653Z cpu2:526651)Vol3: 4226: Failed to get object 28 type 2 uuid 60749f39-70241bc6-d9cc-78f29e90b189 FD 4 gen 1 :I/O error

 I used the command below to check which drives have which names:

esxcli storage core path list

And the errors are related to the USB drive as far as I can see

usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0
UID: usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0
Runtime Name: vmhba32:C0:T0:L0
Device: mpx.vmhba32:C0:T0:L0
Device Display Name: Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)
Adapter: vmhba32
Channel: 0
Target: 0
LUN: 0
Plugin: NMP
State: active
Transport: usb
Adapter Identifier: usb.vmhba32
Target Identifier: usb.0:0
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 32768

pcie.100-pcie.0:0-eui.0000000001000000e4d25c5331cb5201
UID: pcie.100-pcie.0:0-eui.0000000001000000e4d25c5331cb5201
Runtime Name: vmhba1:C0:T0:L0
Device: eui.0000000001000000e4d25c5331cb5201
Device Display Name: Local NVMe Disk (eui.0000000001000000e4d25c5331cb5201)
Adapter: vmhba1
Channel: 0
Target: 0
LUN: 0
Plugin: HPP
State: active
Transport: pcie
Adapter Identifier: pcie.100
Target Identifier: pcie.0:0
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 131072

Will get back when I now more.

Reply
0 Kudos
maksym007
Expert
Expert

are you able to patch that ESXi host?

Reply
0 Kudos
Josvds
Contributor
Contributor

Disabling the services didn't help. Later yesterday it froze again. I had unregistered all vm's, to be sure it wasn't because of them.

I installed the latest downloadable version of VMware esxi 7. So not sure which patch you referring to?

I will check if the installation on an different usb driver will help.

Reply
0 Kudos
maksym007
Expert
Expert

how many ESXi hosts do you have in that Cluster? 

Does only a concrete one have such Issues? please check and compare their settings. 


Have you configured the option to send dumps via dump collector?  Maybe VMware support will be able to assist. 

Reply
0 Kudos
Josvds
Contributor
Contributor

It is just one system, currently I'm trying out a couple of stress tests on the system to see if it is hardware related.

Will get back to you.

Update 19:30

After running memtest for about 2-3 hours it executed 6 iterations but all without any error.
The strange part was that it responded when I pressed ESC but the device didn't shutdown or kill the app so I had to hard shutdown the device again.
Not sure if that was a issue with memtest or something else is going on, but the cursor was still blinking, so it didn't look like it was fully frozen.

Currently started a CPU stress test, will get back on this later. 

Update 20:17

I have executed several CPU stress tests, but they all passed. 
So not sure what the exact cause then is, I have resetted the BIOS to its factory settings.
I have reinstalled VMWare ESXi on the USB drive but this time the version 8.

Not connected any VM yet, lets see if this one fails also.

Reply
0 Kudos
alantz
Enthusiast
Enthusiast

Just moving to another house and you have issues? Maybe electrical issues like inconsistent/low voltage to the server or power fluctuations? 

--Alan--

 

Reply
0 Kudos
Josvds
Contributor
Contributor

I have tried a couple of things last couple of days to find the root cause, but to be honest not sure if I found one yet.

  • I have moved it from my attic to my living room to see if it is related to the outlet in the room
    • But it still froze, so that wasn't the solution
    • Not sure how I can check if the power is a issue?
  • I have booted up Hirren Boot CD
    • But couldn't find any test, but at least it didn't freeze
  • I have booted up Ultimate Boot CD
    • Executed a memory test for a couple of hours, but it didn't freeze
    • Executed a CPU test for a couple of hours, but it didn't freeze
  • I have kept my system on booted in its BIOS to see if it would stop responding there as well (tried this for about 20h)
    • It didn't freeze 
  • I have booted up in Lubuntu live from another USB drive and executed NGStress, Memtester, GtkStressTesting
    • System didn't freeze

So after chatting with some collegues they all think ats memory related, but because the memory tests didn't fail, I now tried to first clean the entire board with air. Disconnected the memory modules and reinstalled them.

Booted ESXi 8 yesterday around 13:00h and now at 0800h in the morning it is still responding. So lets see.

Reply
0 Kudos
Josvds
Contributor
Contributor

Cleaning the board and the memory with air pressure looked like to solve the issue, because it was working table for about 3 days. But today again the system froze. So decided to remove one of the two memory modules to see if it is one of the modules. Still really strange that I didn't find anything by testing the memory.

Reply
0 Kudos
maksym007
Expert
Expert

very interesting

Reply
0 Kudos
Josvds
Contributor
Contributor

Again earlier today the system froze, with one memory module.

So this evening I removed the memory module and replaced it with the other memory module inside the second memory bank.
Besides this I disconnected the FAN from the CPU and cleaned it and put on some new fresh cool pasta.
It did run fine now for about 2 hours, but now already the system is frozen again.

After shutting down the system and directly checking the temperature of the CPU inside the BIOS, it was only 50 degrees.

So I have no idea what is causing my system to freeze so frequently without the ability to reproduce it with tests.

Reply
0 Kudos
Josvds
Contributor
Contributor

I found on this page that it could be related to the ASRock DeskMini itself:

(1) Deskmini 110 power supply replacement? : ASRock (reddit.com)

They asked me:

I'm running throttled with this config and it has remained stable ever since.

But I'm not sure how to do this within ESXi, do you guys perhaps know how?

Reply
0 Kudos
caio42
Contributor
Contributor

I'm having exactly the same problem here. This is very annoying. VMWare ESXi, 6.7.0, 13644319. Every now and then the system just froze, I have executed also some stress tests in the system. I'm running NFS in a separate server, bare metal, and I'm thinking something there is making everything messy. Please, someone has any idea of this?

Reply
0 Kudos