Hi All,
The server installed ESXi 6.0 build 3620759 as specs below:
Supermicro X9DRL-3F/iF
CPU: 2 x Intel Xeon E5-2620 2.00GHZ
Memory: 4 x 16GB
HDD: 2TB (RAID 10) [ServerRAID M5110]
Everything work good after installing for 8 months ago. Recently, It hangs frequently, about 2 days hanging once, sometime it hangs twice a day. I must to force hard reboot the physical server since it not responding at console view.
I've checked the hardware, everything are in normal state.
Anyone suggest me how to figure-out the issue?
Thanks.
I'm glad to say that server issue has been resolved. Server has run more than 12 days without hangs. So this exactly the memory at DIMM-D1 has error.
you have old esxi build number - ESXi 6.0 Update 2 2016-03-16 3620759
latest build number - ESXi 6.0 Patch 6 2017-11-09 6921384
you can find out all build version in below link.
before going to update please check vmware hardware combustibility list .
Thank you for your information.
It so weird that it has been working well for long time.
I seem to have a similar issue. My system hasn't been running for long but it keeps locking up and acting very unstable. I have seen some purple ESX screens and most of the time it just locks up and the console is unresponsive where I have to hard-reset it. I have the exact same motherboard/cpu combo and vmware build. Did you ever get it resolved by simply installing the latest esxi patch?
Thanks
After updated to latest version, it still hangs. I suspect that the issue is hardware related. Maybe harddrive getting problem. I'll keep checking.
Thanks for replying. Can you clarify what you mean by harddrive getting problem? I believe the issue is motherboard related. Every time it happens I get a flood of correctable ECC errors in the IPMI log. Do you see the same thing?
I just see Event log in IPMI: Correctable Memory ECC - Asserted. I suppose that HDD cannot handle too much VMs, I think it exceed IOPS of HDD. Currently, the host is handling 8 guest machines (7 Windows 2008 and 1 Linux) , I'm wondering that as specs mentioned above can handle those VMs or not? By the way, I've just moved some VMs to another host to reduce load and keep monitoring.
Those errors are memory errors usually indicating issue with RAM or DIMM slot. Can you tell me which DIMM slot is reporting as having the ECC errors in the IPMI logs? My logs look just like this:
2018/02/28 10:28:32 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
Thanks
My IPMI logs. it not show me any dimm.
See the image below for more information the sensor readings cannot reading memory temperature information, this issue has happened for a long time. Is it affect to esxi hang?
So according to that, your motherboard doesn't show any memory slots filled. Lol. Do you have the latest 3.2 bios installed on your motherboard?
This is what it's supposed to look like depending on which dimm slots you have filled:
ok, I'll try to update Bios. By the way, I see in the Bios event logs a lot of events:
Smbios 0x01 P1_DIMMD1 - Description: Single Bit ECC Memory Error . Is that memory at DimmD1 error?
It certainly is. The P1-DIMMD1 slot can be found by looking at the m/b diagram here: https://www.supermicro.com/manuals/motherboard/C606_602/MNL-1298.pdf
The fast and simplest way to update your firmware is by using the Supermicro Update Manager here: Supermicro Update Manager (SUM) | Supermicro Server Management Utilities | Products - Super Micro Co...
Thanks and good luck.
I've took Memory out from P1-DIMMD1 then keep monitoring server status. Thank for the guide
How about your Server problem? Is it solved?
I moved the memory around the slots to see if the problem follows or it stays with the same slot. So, I'm monitoring to see if it locks up again. Last time it ran for 10 days with no problem and then it locked up on the 11th day. So, it's hard to tell yet. I'll let you know what I find. I have a friend with the the same exact m/b cpu and he has the same problem with the same dimm slot as me. So, we are trying to figure out if this is a manufacturer defect and if so, we might be looking into getting another m/b.
Thank you for sharing your experience. After taking out 1 Memory at Slot P1-DIMMD1 then the server running around 3 days has no problem, hoping it can run normally in 1 month.
I remember that during I changing the all memory slot after changing server start exactly 7 mins and down again :smileygrin: . I must revert it to the old slot and take out 1 memory out then it start and run normally ~ 3 days without hangs :smileygrin: .
Did you try to check event log in BIOS? Did you try to take the memory located at DIMMF1 ?
I have no event logs so far. I cleared them out from the last lockup so I can have a clear log. I am on day 3 with no lockups. I didn't take the memory out of DIMMF1, but I replaced with a different module to see if the problem persists even with a different module. I'm trying to figure out if it's a slot issue or a memory issue.
I'm glad to say that server issue has been resolved. Server has run more than 12 days without hangs. So this exactly the memory at DIMM-D1 has error.