VMware Cloud Community
tcornwell
Contributor
Contributor

VMware vSphere ESXi Lockup

We are running VMware vSphere 4 ESXi on Sun X6450 blades with the Intel six core dunnington processors, boot from Compact Flash, iSCSI storage, 10GbE, and 92GB of RAM. The blades run fine for 10 to 30 days before completely locking up. There is no PSOD. No indication of any log that something has gone wrong. The machine simply stops responding. The issue is intermittant and apparently random. We can't find any correlation to the problem except the hardware. We are running on X6250 blades with similar configuration as well with no problem. Also, we see the problem when there is high load, when there is moderate load, and when there is no load. We have even seen it in blades which have ESXi in maintenance mode.

Has anyone seen a similar problem? If so, what is your config?

0 Kudos
9 Replies
tcornwell
Contributor
Contributor

Correction: 96GB of RAM

0 Kudos
HughBorg707
Hot Shot
Hot Shot

I've seen this same issue brought up in the forums before. Turns out it was iSCSI SAN related.

Here is a related link: http://communities.vmware.com/thread/213710

The other articles I read mentioned time outs, rebooting the SAN without taking VMs offline, and other issues.

Regards

0 Kudos
tcornwell
Contributor
Contributor

That link looks similar, but not the same. The post you refer to says that the server is temporarily hanging. In our case, the server is completely locked up and doesn't free up again. Also, the post refers to some indicators prior to the lock up and in the logs to help guide troubleshooting. We have none of that. As we are running other blades in the exact same storage configuration without the problem, I find it highly unlikely that iSCSI is related, however, we have ruled nothing out at this time.

0 Kudos
zhangtong3910
VMware Employee
VMware Employee

Coud you get vm-support logs using vsphere client after the reboot of the lock up server?

It's a serious problem and you can submit you logs to vmware.

0 Kudos
walidtmi
Contributor
Contributor

I have exactly the same Symptoms, our config is 4 Blade6000 with 6 to 7

X6450 each, 2 ST6140 with 2 to 3 jbods each. We use 2 NEM (X4250A) on

each Blade and for each X6450 a dual ports FC card.

The only common point with my config is the X6450 module.

We upgraded all modules to vSphere U1, but this lockup persist.

Have you found a solution to this lockup ?

0 Kudos
sto6ma9ch
Contributor
Contributor

We are experiencing this issue as well with the Sun x6450 blades. Here's our configuration:

  • 2 Sun 6000 Blade Chasses (1 per datacenter)

  • 10 Sun x6450 Blade Modules (6 in one, 4 in the other)

  • 1 4-Port Gigabit Ethernet PCIe card per blade

  • 1 2-Port Gigabit Ethernet + 2-Port FiberChannel PCIe card per blade

  • VMware ESX 3.5 Update 4, ESX 3.5 Update 5, ESX 4.0, ESX 4.0 Update 1

Description of the issue:

  • ESX server is marked as Not Responding in vCenter

  • Console of the server is unresponsive; no errors logged to the console

  • Network connectivity to the server stops

  • All virtual machines running on the server are powered onto other hosts (HA host failure response)

  • Hard reset allows the server to rejoin the cluster

  • No errors for the lock-up are logged in ESX

  • No errors are logged in the blade iLOM

  • No errors are logged in the blade chassis iLOM

  • Occurs on all 10 blades in both chasses at each datacenter

  • Have not been able to reliably reproduce the issue

  • Frequency can be from one day to two months; seemingly random

We have someone at Sun support who may have found a bug with the x6450 and ESX. This has not been confirmed yet.

0 Kudos
jstevensTMRK
Contributor
Contributor

I'm having a very similar problem, do you have any information about the (potential) bug?

0 Kudos
sto6ma9ch
Contributor
Contributor

Sun support has not provided any more information about the potential bug.

We just noticed that the x6450 blade modules are no longer sold on Oracle's website, which makes resolving this issue moot for us. Even if the issue was resolved, we'd have 10 blades across two chasses that we couldn't expand. Moving all ten x6450 blades to one chassis and putting other blades with potentially different processors in the other chassis would cause issues with Site Recovery Manager. It makes more sense to replace these blades with ones that work than to keep waiting until a blade locks up, generate diagnostic logs, forward those to support, and get the call that the logs don't include any clues as to the cause of the lock-up.

We've focused on working with our Sun sales rep. to replace the x6450's with their AMD counterpart, the x6440's. From what I can see, the x6440 is the only 4-socket, six-core blade that's both available for sale and supported by VMware. Of course, the x6450 is supported by VMware, too. We'll see how that goes.

0 Kudos
sto6ma9ch
Contributor
Contributor

More info on this:

We were able to see that this issue does generate a purple screen error, but only after changing the console's TTY to the vmkernel's diagnostic logging screen (Alt+F12) (see attached screenshot):

CpuSched: VcpuWaitForSwitch: timed out

I also found this thread on Oracle'sforum website:

Looks like it's definitely an issue with the hardware or firmware since this issue occurs on Red Hat and Fedora servers. I guess since ESX is based on Red Hat, there could be a connection.

0 Kudos