VMware Cloud Community
bleuze
Enthusiast
Enthusiast

host random reboot (affecting Dell PE 630 servers with E5-26xx v4 CPU)

My vsphere 6.5 host just rebooted on me about 1/2 hour ago. Where can I find a log telling me why?

19 Replies
daphnissov
Immortal
Immortal

/var/log/vmkernel.log

0 Kudos
a_p_
Leadership
Leadership

In case it was a hard restart, and your host has a management interface (iLO, IDRAC, ...) you may want to check the logs there.

An ESXi hosts doesn't usually reboot, even if it detects an issue, but rather ends up in a PSOD (Purple Screen Of Diagnistics), so it may be related to a hardware, or even a power issue.

André

bleuze
Enthusiast
Enthusiast

Thanks. Here is the log. Can someone help me decipher it? I am pretty sure the reboot happened right at 2018-07-26T18:57

0 Kudos
bleuze
Enthusiast
Enthusiast

Hi Andre

Thanks for the clue. Looks like it was a hard restart.

From the idrac log:

PWR2262: The Intel Management Engine has reported an internal system error.

2018-07-26T18:59:29-0500

Log Sequence Number: 447

Detailed Description:

The Intel Management Engine is unable to utilize the PECI over DMI facility.

Recommended Action:

Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

CPU0000: Internal error has occurred check for additional logs.

2018-07-26T18:59:30-0500

Log Sequence Number: 448

Detailed Description:

System event log and OS logs may indicate the source of the error.

Recommended Action:

Review System Event Log and Operating System Logs. These logs can help the user identify the possible issue that is producing the problem.

PWR2262: The Intel Management Engine has reported an internal system error.

2018-07-26T18:59:31-0500

Log Sequence Number: 449

Detailed Description:

The Intel Management Engine is unable to utilize the PECI over DMI facility.

Recommended Action:

Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

RAC0703: Requested system hardreset.

2018-07-26T18:59:31-0500

Log Sequence Number: 450

Detailed Description:

Requested system hardreset.

Recommended Action:

No response action is required.

Well, that may be more for a Dell forum to figure out what this means, except for the message saying"Review System Event Log and Operating System Logs". So, as for Operating system logs. Is the vmkernel log it, or should I post syslog.log as well?

0 Kudos
bleuze
Enthusiast
Enthusiast

OK

I see references to scsi devices naa.6d0946601f70d30022186609290abb0c and naa.6d0946601f70d300228215fee13a5b8c in both the vmkernel.log and in the syslog.log shortly before the restart.

These are the 2 virtual disks created by the dell RAID card on the host and presented to VMWARE as raw storage. Could it be something going on with them? idrac says the virtual disks are healthy. They are what the 2 VMware datastores are built on.

a small portion of the syslog is attached. the reboot happens close to the beginning. right after the 2018-07-26T18:55:02Z log entry

0 Kudos
Devi94
Hot Shot
Hot Shot

Looks like you have an issue with lsi_mr3 driver. Please make sure your drivers are up to date.

0 Kudos
Lxypi
Contributor
Contributor

I took a peek at the logs, and it doesn't look like this was caused by the RAID driver.  Yes, the lsi_mr3 driver is very broken prior to May 2018, so you should upgrade if you haven't:

https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI65-AVAGO-LSI-MR3-77050900-1OEM&produ...

I just had a Dell R630 crash with this same symptom, and this entry in the Lifecycle Log:

PWR2262: The Intel Management Engine has reported an internal system error.

2018-08-01T07:08:07-0500

Log Sequence Number: 2245

Detailed Description:

The Intel Management Engine is unable to utilize the PECI over DMI facility.

Recommended Action:

Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

I see that Dell has released an urgent BIOS/microcode package last week:

Dell Server PowerEdge BIOS R630/R730/R730XD Version 2.8.0 | Dell US https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverId=2JFRF

I'm going to get upgraded tonight.  I would suggest getting this update on your Dell 13th gen hosts, as it appears to be a broken microcode issue.

bleuze
Enthusiast
Enthusiast

Thanks Lxypi

I wasn't so sure about the lsi_mr3 driver being the problem. But to cover all the bases I decided to upgrade. I imagined that by updating to esxi 6.7 using the latest Dell OEM iso (released 2018-07-16), I would get the most up-to-date drivers. And indeed my lsi_mr3 driver went from version 7.700.50.00-1OEM.650.0.0.4598673  to 7.703.18.00-1OEM.650.0.0.4598673. But that is still not as up-to-date as the 7.705.09.00-1OEM.650.0.0.4598673 that you gave us a link to. So I will patch my 2 hosts with this newer version too.

As for the bios. My host is also a Dell 630 (ours is T630) and this weekend I also checked for bios updates and went from 2.6.0 to 2.8.0. I hope that solves the issue. We have had this server in production since about January this year (2018) and this is the first time it has done this random reboot.

I think I'll contact Dell support as well to see if they know anything about that lifecycle log entry that we both had. I'll update here if I find out anything.

0 Kudos
bleuze
Enthusiast
Enthusiast

I have some info from Dell Support. Following is an excerpt of the transcript of my chat session (they recommend changing the default bios settings):

DELL:

In the system profile settings of the bios, it's currently set to " PerfPerWattOptimizedDapc "

10:12:53 AM

We'll need to set that to custom, and then adjust the C1E and C-states to disabled.

10:13:39 AM

Sometimes what we've seen happen with those C-states enabled is in situations where the OS on the server is a non-windows OS and there are periods of low CPU utilization, we'll see these errors and resets. Engineering is looking into aware of the issue occurring and has been looking into root causes. They will update our hardware team if a root cause is determined, but in the meantime, they are asking that when we encounter this that we make sure the CPU power states  (c-states) are set to disabled.

10:17:19 AM

ME:

great, thanks, is that the only suggested bios settings?

10:18:27 AM

DELL:

Yes those 2 C-state settings. The C1E and C-states, listed under system profile once the system profile has been set to "custom"

10:19:53 AM

ME:

We also have a PE R730 server. does engineering suggest we do the same bios settings on it?

10:20:52 AM

DELL:

It depends on the CPU. If it also has the E5-26xx v4 CPU, and has experience the Intel Management Engine error in the system logs, then yes. Otherwise, we generally recommend keeping c-states enabled as that can change overall power consumption that the server uses.

10:24:47 AM

The c-states essentially allow the CPU to not draw as much power during idle times. In most situations it works without issue. When C-states are disabled, the CPU will remain on and fully powered all the time.

10:25‎:‎29 AM

0 Kudos
PepsiGuy2
Contributor
Contributor

I just ran across this thread.  I'm experiencing the same problem (for 6 months or so now).  Sporadic reboots of ESXi 6.5.   I've updated to the latest firmware which includes BIOS 2.8.0 as well as newest ESXi 6.5 but getting the same random reboots.   Looking through the Lifecycle Logs in iDRAC I came across the "PWR2262" error.  

I just changed my C1E and C-states to disabled as you said was instructed from Dell -- I'll see if that works around the problem.    Did this solve the issue on your end?

0 Kudos
bleuze
Enthusiast
Enthusiast

Hi PepsiGuy2

I have not had another random reboot since I applied the ctates config in the bios 3 months ago. But of course before that, the server has only ever had one random reboot since it was purchased 6 months previously. So the fact that we have not had another reboot is not conclusive that the issue is fixed.

0 Kudos
PepsiGuy2
Contributor
Contributor

On an idle ESXi R630 system with E5-2699 v4 @ 2.20GHz CPUs, I was getting a reset every 24 hours or so before this change.  After the above change (it's only been a few days), it has not rebooted yet.   The issue/workaround makes sense as this server would only reset every few months in the past but it was normally under load so it likely didn't put the CPUs into cstate very often.  But now that its idle it would go into lower cstates often and hence the frequent resets.   Only time will tell, but it's looking like a good workaround.

0 Kudos
JohnDSW
Contributor
Contributor

We have r730's with E5-2699 v4, BIOS 2.8.0.  Ours are suffering from cascade failures.  First the servers crash and we got messages about the CPU failing in the lifecycle log.  Then everything (HBA,NIC,etc) reported a failure of some kind, then the drac would become inaccessible.  We got up to Dell engineering before they gave us a response to disable QPI Link L1 Power management.  Apparently the setting has a microcode issue they did not disclose, and then set it to enabled by default in the 2.8.0 BIOS. They instructed us to disable that feature.

System Settings > BIOS settings > Processor Settings > QPI Link L1 Power management

Whether or not this works is still up in the air.  I'll be disabling it on roughly 70 servers. heavy sigh

0 Kudos
bleuze
Enthusiast
Enthusiast

Hmm, I guess I'll be disabling that setting too next time I reboot the servers - just to be safe.

0 Kudos
testbunny
Contributor
Contributor

Hey guys, does anybody have any update on this?

We have 15 pieces of M630 servers, but 9 of them already have the same issue. Vmware 6.5, Horizon 7.

We set up the servers in Jan 2017, since mid 2018 none of the issues, since then a lot. 2683 v4 processors. Sorry but our initial setup set the profile to Performance, which has the C states disabled, so I can't set it to disabled again. But the issue is totally the same, all of the blades are on the latest firmware. A bit annoying from this expensive system.

0 Kudos
bleuze
Enthusiast
Enthusiast

On Jan 31 Dell release a new BIOS (version 2.9.1) for the Poweredge T630. I installed it on the weekend. The release notes do not specifically say they address this issue so I cntacted Dell support to ask. This is their reply: "I heard back already that yes, BIOS 2.9.1 should resolve the issue that disabling C-States is a workaround for on these servers. "

By the way, as of yesterday , Feb 3. this new bios still is not in the database that you connect to when booting into lifecycle manager and looking for firmware updates. I had to download the Linux BIOS update package from the Dell support site, copy it into a Linux boot USB flash disk and boot from that media to install the BIOS. I had an older version of the "Dell Support Live Image" ISO (newer version is available on the Dell support site) that I used as boot disk. Note it is not a clickable GUI install, find the file and install from command line using sudo. In my case:

$ sudo /mnt/disk/sda1/BIOS_YY63D_LN_2.9.1.BIN

It doesn't actually install the bios, It passes it over to the lifecycle controller which installs it on reboot.

0 Kudos
SamiKirves
Contributor
Contributor

Sorry to lift this back up after several years, I have Dell R630 with E5-2699 v4 and BIOS 2.14 and iDRAC 2.83.83.83.

It experienced PWR2262 event with immediate server reboot. No PWR2264 afterwards.

OS ESXi 6.7 19195723

I just mean that updates have not fixed this.

I must check and configure C-states.

0 Kudos
Kinnison
Commander
Commander

Hi,


AFAIK,


If I'm not mistaken PWR2262 is related to a crash of the "Intel Management Engine".
On another well known forum some user reported issue with bios V2.14 on machine like yours (and mine).
Just serching on the manufacturer site seems that this bios version had been removed.


So, your current problem in my humble opinion (I could be wrong) is more related to an unfortunate combination of machine software and/or bios settings rather than an ESXi specific problem. Personally (a choice) my machines are set with the "maximum performance profile", ESXi is also set up to not use any power management features.


Regards,
Ferdinando

SamiKirves
Contributor
Contributor

Excellent catch!

It seems that this version is not any more available in Dell support, but my Open Management Enterprise downloaded it safe earlier...

I must learn how to downgrade BIOS, if possible.

0 Kudos