VMware Cloud Community
jst68
Contributor
Contributor
Jump to solution

ESXi 6.7 Server becomes unresponsive (no kernel dump log)

I am using a small ESXi 6.7 installation in my home lab. On 6.5 it ran without any issues, but since I have upgraded it to 6.7 the machine simply fails from time to time. It used to just fail once a week, but after installing the latest patch, it fails on a daily basis or more frequently depending on use.

Now, I know that the first thing to do is to get the kernel dump, but unfortunately there isn't one. For monitoring, I have connected the machine to a vCenter backend, but it also doesn't show any issues beyond the apparent ones (VMs disconnect, host becomes unreachable, etc.)

I have also taken a look at the logs (support pull), but I can't see any smoking guns.

The host machine uses the following hardware:

CPU: i7-4771 (info pulled from vCenter, but I somehow thought that the machines is using a Xeon processor)

Memory: 32 GB

SSD: Intel SSD 80GB (ESXi); reserved 8 GB for flash

SSD: Samsung 250 GB (all running VMs); no flash reservation

HDD: WD 3 TB (unused VMs)

ISCSI: Connected to Synology DS1813+

Network 1: Realtek (1GBit) - Management Network / 2 VMs

Network 2: Intel (1GBit) - 1 VM

Network 3: Intel (1GBit) - ISCSI

I initially thought that it might be some kind of hardware issue, but a recent SMART test on the drives showed no problems.

It is puzzling to me that the system seems to fail based on workload. However, the CPU usage never exceeds 60% and memory usage never goes beyond 50%. Nevertheless, it seems to fail when the network workload is increasing. On average, there are only 2-3 VMs running.

Once the problem occurs, I can no longer access the machine over any of the network cards. Even ping stops working. All machines basically freeze up. On reboot, everything goes back to normal.

If left alone (e.g. VMs are loaded, but unused), it can run for extended times (10+ days), but if used it can fails within the hour.

I have some experience using the ESXi shell and can look up stuff as needed. Just hope that somebody can point me in the right direction (e.g. example on what to look for).

Any help or feedback is appreciated!

Reply
0 Kudos
1 Solution

Accepted Solutions
jst68
Contributor
Contributor
Jump to solution

Today, I hooked up a monitor to the ESXi server, but it didn't yield any new findings because there is no additional information displayed when the problem occurs. The system simply freezes up.

I also went over my BIOS settings and disabled the XMP profile for my memory, but that was more an effort to increase furhter increase stability. Needless to say that it didn't solve the problem.

Then I did some digging and, based on another post,  found that there are two drivers for AHCI devices. One is named sata-ahci and the other one is named vmw-ahci. I found my system to use vmw-ahci. So, I figured things can't get any worse at this point and simply disabled vmw-ahci by using the following SSH command:

 

esxcli system module set –e=false –m=vmw_ahci

 

 Then I rebooted the system and confirmed that it is now using the other driver (sata-ahci) by checking the Storage -> Adapter tab.

Well, the problem hasn't come back since then. I will monitor it some more and report back if there is any change, but I have been able to crank up system load to a level where it previously would fail within minutes and it just keeps working. 😀

Hope this will help someone else running into the same problem!

View solution in original post

Reply
0 Kudos
6 Replies
JCL_MDOT
Contributor
Contributor
Jump to solution

I figured it out, well sortof.

I reset the BIOS settings to default, and then it worked.  I think I had a bad setting in the IOMMU settings.

-JCL

Reply
0 Kudos
scott28tt
VMware Employee
VMware Employee
Jump to solution

@JCL_MDOT 

Are you also @jst68 ??

 


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
Reply
0 Kudos
JCL_MDOT
Contributor
Contributor
Jump to solution

Whoops, no, looks like I replied to the wrong post.

My bad....

-JCL

Reply
0 Kudos
jst68
Contributor
Contributor
Jump to solution

Since the logs don't give me any helpful information, I am going through a process of elimination now.

I have started with testing my iSCSI connector and it seems to work fine even under high workload. Then I continued to test individual VMs and found that an increased network workload seems to result in the problem occurring quicker. However, I can't force the problem by maximizing the network load for a VM. It is always somewhat time based. That said, a VM with little workload might last for a week (or longer) where as a VM with a high network load will fail within less than 30 minutes.

Next, I checked on the Realtek network card, but when I increased the network workload for a VM connected to the one of the Intel cards, it also failed.

Now, I am pretty much out of ideas. As a last resort, I might try to connect the host to a monitor to see if there is any visible output when the issue takes place.

Any other suggestions what I could try? Thank you!

Reply
0 Kudos
jst68
Contributor
Contributor
Jump to solution

Today, I hooked up a monitor to the ESXi server, but it didn't yield any new findings because there is no additional information displayed when the problem occurs. The system simply freezes up.

I also went over my BIOS settings and disabled the XMP profile for my memory, but that was more an effort to increase furhter increase stability. Needless to say that it didn't solve the problem.

Then I did some digging and, based on another post,  found that there are two drivers for AHCI devices. One is named sata-ahci and the other one is named vmw-ahci. I found my system to use vmw-ahci. So, I figured things can't get any worse at this point and simply disabled vmw-ahci by using the following SSH command:

 

esxcli system module set –e=false –m=vmw_ahci

 

 Then I rebooted the system and confirmed that it is now using the other driver (sata-ahci) by checking the Storage -> Adapter tab.

Well, the problem hasn't come back since then. I will monitor it some more and report back if there is any change, but I have been able to crank up system load to a level where it previously would fail within minutes and it just keeps working. 😀

Hope this will help someone else running into the same problem!

Reply
0 Kudos
jst68
Contributor
Contributor
Jump to solution

Just a quick update: The system is running rock solid for almost 2 days now.

It's frustrating that drivers like this one are passing testing at VMware. I mean I get the hardware compatibility thing (even though I neither like nor agree with it), but it is safe to assume that these problems can happen to customers using the same driver on certified hardware as well.

Disappointing and a warning sign to slow down the upgrading plans in our data center!

Reply
0 Kudos