VMware Cloud Community
Evgenus
Contributor
Contributor

ESXi PSOD and random freezes

Hi folks,

I am new with ESXi. And my english is far from good but here is the problem.

I have 3 hosts running ESXi and which i thought has best hardware, consistently freezes from time to time. One once even gone to PSOD.

His hardware are supermicro SYS-6018R-WTR (BIOS version 2.0b) 2 x Xeon E5-2620v4, 64Gb ddr4 ram, LSI 9260-4i RAID with battery, 4 server toshiba disks.

Look at the picture below with error message:

esxi_error.png

Is it something related to processor?

Thank god we are on holidays until 9th of January but please can somebody enlighten me what can be done to make this host stable?

Thank you!

Reply
0 Kudos
31 Replies
daphnissov
Immortal
Immortal

These types of machine check exception errors (MCE) often indicates a hardware failure and so you should check with Supermicro on diagnosing that. That said, you are behind in patches for 6.5 and so you should plan to install P2 which came out a couple weeks ago then observe behavior.

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

memtest x86

is your RAM ecc and do you have the setting for ECC on in the BIOS?  Name brand ram?  ram purchased in a set?

Any goofy PCIe cards?

Reply
0 Kudos
Evgenus
Contributor
Contributor

Hi,

Running MEMTEST. Will post screenshot when it's done.

Yes RAM is ECC KVR21R15D4/16 https://www.kingston.com/datasheets/KVR21R15D4_16.pdf

No RAM purchased as 4 single DIMMs.

Stramge thing also that you can boot server he will show no problems during boot.

Another time he tell that problematic DIMM in slot 1. Another time in slot 2. Error inconsistent.

Only PCIe in server is Raid LSI 9260-4i but hard drive system looks good.

There was Additional 4 port intel i350 rj-45 pci card but i removed it.

Reply
0 Kudos
Evgenus
Contributor
Contributor

No specific ECC settings in BIOS.

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

I cross flashed a Dell Perc H200 with retail firmware and tried to use it in a HP Z800 workstation with 1 CPU and had all kinds of weird memory errors in Windows 7 x64.

So if you can remove all cards and your memtest finishes but fails with your cards in that one box it could be a bad card.

Notice the speed of your memtest and see if the other machines like it are the same speed.

Could be 1 stick of ram is incompatible.  are all the RAM the same part number and voltage?

sounds like you have some flaky hardware

Reply
0 Kudos
Evgenus
Contributor
Contributor

Hi,

memtest still running 6 hours 33 min.

60 our of 64 Gb tested and no errors so far.

Yes all DIMMs are same part number and voltage.

Look at the screenshot below

memtest.png

Hm,

You got me interest. Before i used this LSI 9260-4i i tried to use Perc h310 on this server.

No matter what i did ESXi was unable to detect any volumes even with custom ISO or with DELL esxi iso.

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

can you run memtest on another box that is identical,  6.5 hours for 64gb is a long time for a DDR3-4 system. There might be a different version of Memtest that will report your chip-set correctly.

Reply
0 Kudos
Evgenus
Contributor
Contributor

This one is the latest.

I will not be able to test on another machine until 5th of January.

It/s holidays here from 1st to 9th.

I need to make it work somehow until 10th. =/

This is the latest version of memtest.

You can suggest any other Memtest86+ - Advanced Memory Diagnostic Tool

I can run it via IPMI with bootable ISO.

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

There are 3 versions of Memtest

1) commercial

2) the one you have(v5)

3) another one I think v4

all detect hardware differently.

Anyway I would take 1/2 your ram out and test with just 1/2 and see what the speed is and time, then test the other 1/2, then test with like you are doing with all the ram.

6.5 hours for a full pass, or 1.25 passes is too long

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

I too am doing some shut down work.  I have issues with VDP so I was hoping someone here would see that and offer some suggestions. LOL

Reply
0 Kudos
Evgenus
Contributor
Contributor

I can feel your pain bro. LOL

Reply
0 Kudos
Gary_Williams
Enthusiast
Enthusiast

Is this hardware on the HCI?

Reply
0 Kudos
Evgenus
Contributor
Contributor

Seems like hardware.

But it is also possible that i am dumbass.

Exactly the same installed ESXi on old HP Gen8 works perfectly.

I hate supermicro servers so much.

Reply
0 Kudos
Evgenus
Contributor
Contributor

Or you mean memtest HCI?

Reply
0 Kudos
Gary_Williams
Enthusiast
Enthusiast

HCL Sorry, too much new years spirit!

The hardware compatibility list. You may also need to firmware update your network cards and the like. I had a Dell server PSOD on me due to a fibre channel FC firmware bug.

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

The HP Customized ISO should not be used on a supermicro system.  on the supermicro boxes you should use the vmware one or a special one from supermicro.  have you placed a ticket with supermicro?

Reply
0 Kudos
SmokinJoe59
Enthusiast
Enthusiast

qlogic?  old firmware on the HBA or old firmware on the Dell server?  Was this a new 32/64/128gb fiber channel card?

Reply
0 Kudos
Evgenus
Contributor
Contributor

I used HP ISO on HP server.

I don't used it on Supermicro sever bro.

Yes i placed ticket on supermicro same time i creates thread here. No respond from them.

There is no special supermicro iso of ESXi so i just used standard one from vmware site.

Reply
0 Kudos
Gary_Williams
Enthusiast
Enthusiast

QLogic 16Gbit, it was firmware on the qlogic card itself that triggered a vmware PSOD.

Reply
0 Kudos