VMware Cloud Community
jacky86
Enthusiast
Enthusiast
Jump to solution

ESXi 6.0 Build 3620759 - hangs frequently

Hi All,

The server installed ESXi 6.0 build 3620759 as specs below:

Supermicro X9DRL-3F/iF

CPU: 2 x Intel Xeon E5-2620 2.00GHZ

Memory: 4 x 16GB

HDD: 2TB (RAID 10) [ServerRAID M5110]

Everything work good after installing for 8 months ago. Recently, It hangs frequently,  about 2 days hanging once, sometime it hangs twice a day. I must to force hard reboot the physical server since it not responding at console view.

I've checked the hardware, everything are in normal state.

Anyone suggest me how to figure-out the issue?

Thanks.

0 Kudos
1 Solution

Accepted Solutions
jacky86
Enthusiast
Enthusiast
Jump to solution

I'm glad to say that server issue has been resolved. Server has run more than 12 days without hangs. So this exactly the memory at DIMM-D1 has error.

View solution in original post

0 Kudos
19 Replies
admin
Immortal
Immortal
Jump to solution

you have old esxi build number - ESXi 6.0 Update 2    2016-03-16    3620759

latest build number - ESXi 6.0 Patch 6    2017-11-09    6921384

you can find out all build version in below link.

VMware Knowledge Base

before going to update please check vmware hardware combustibility list .

VMware Compatibility Guide - System Search 

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

Thank you for your information.

It so weird that it has been working well for long time.

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

I seem to have a similar issue. My system hasn't been running for long but it keeps locking up and acting very unstable. I have seen some purple ESX screens and most of the time it just locks up and the console is unresponsive where I have to hard-reset it. I have the exact same motherboard/cpu combo and vmware build. Did you ever get it resolved by simply installing the latest esxi patch?

Thanks

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

After updated to latest version, it still hangs. I suspect that the issue is hardware related. Maybe harddrive getting problem. I'll keep checking.

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

Thanks for replying. Can you clarify what you mean by harddrive getting problem? I believe the issue is motherboard related. Every time it  happens I get a flood of correctable ECC errors in the IPMI log. Do you see the same thing?

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

I just see Event log in IPMI: Correctable Memory ECC - Asserted. I suppose that HDD cannot handle too much VMs, I think it exceed IOPS of HDD. Currently, the host is handling 8 guest machines (7 Windows 2008 and 1 Linux) , I'm wondering that as specs mentioned above can handle those VMs or not? By the way, I've just moved some VMs to another host to reduce load and keep monitoring.

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

Those errors are memory errors usually indicating issue with RAM or DIMM slot. Can you tell me which DIMM slot is reporting as having the ECC errors in the IPMI logs? My logs look just like this:

2018/02/28 10:28:32    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted

Thanks

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

My IPMI logs. it not show me any dimm.

Memory ECC error.png

See the image below for more information the sensor readings cannot reading memory temperature information, this issue has happened for a long time. Is it affect to esxi hang?

sesorreading.png

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

So according to that, your motherboard doesn't show any memory slots filled. Lol. Do you have the latest 3.2 bios installed on your motherboard?

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

This is what it's supposed to look like depending on which dimm slots you have filled:

2018_03_02_05_25_15.png

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

ok, I'll try to update Bios. By the way, I see in the Bios event logs a lot of events:

Smbios 0x01 P1_DIMMD1 - Description: Single Bit ECC Memory Error . Is that memory at DimmD1 error?

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

It certainly is. The P1-DIMMD1 slot can be found by looking at the m/b diagram here: https://www.supermicro.com/manuals/motherboard/C606_602/MNL-1298.pdf

The fast and simplest way to update your firmware is by using the Supermicro Update Manager here: Supermicro Update Manager (SUM) | Supermicro Server Management Utilities | Products - Super Micro Co...

Thanks and good luck.

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

I've took Memory out from P1-DIMMD1 then keep monitoring server status. Thank for the guide

How about your Server problem? Is it solved?

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

I moved the memory around the slots to see if the problem follows or it stays with the same slot. So, I'm monitoring to see if it locks up again. Last time it ran for 10 days with no problem and then it locked up on the 11th day. So, it's hard to tell yet. I'll let you know what I find. I have a friend with the the same exact m/b cpu and he has the same problem with the same dimm slot as me. So, we are trying to figure out if this is a manufacturer defect and if so, we might be looking into getting another m/b.

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

Thank you for sharing your experience. After taking out 1 Memory at Slot P1-DIMMD1 then the server running around 3 days has no problem, hoping it can run normally in 1 month.

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

I remember that during I changing the all memory slot after changing server start exactly 7 mins and down again :smileygrin: . I must revert it to the old slot and take out 1 memory out then it start and run normally ~ 3 days without hangs :smileygrin: .

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

Did you try to check event log in BIOS? Did you try to take the memory located at DIMMF1 ?

0 Kudos
deeztech69
Contributor
Contributor
Jump to solution

I have no event logs so far. I cleared them out from the last lockup so I can have a clear log. I am on day 3 with no lockups. I didn't take the memory out of DIMMF1, but I replaced with a different module to see if the problem persists even with a different module. I'm trying to figure out if it's a slot issue or a memory issue.

0 Kudos
jacky86
Enthusiast
Enthusiast
Jump to solution

I'm glad to say that server issue has been resolved. Server has run more than 12 days without hangs. So this exactly the memory at DIMM-D1 has error.

0 Kudos