We are running VMware vSphere 4 ESXi on Sun X6450 blades with the Intel six core dunnington processors, boot from Compact Flash, iSCSI storage, 10GbE, and 92GB of RAM. The blades run fine for 10 to 30 days before completely locking up. There is no PSOD. No indication of any log that something has gone wrong. The machine simply stops responding. The issue is intermittant and apparently random. We can't find any correlation to the problem except the hardware. We are running on X6250 blades with similar configuration as well with no problem. Also, we see the problem when there is high load, when there is moderate load, and when there is no load. We have even seen it in blades which have ESXi in maintenance mode.
Has anyone seen a similar problem? If so, what is your config?
I've seen this same issue brought up in the forums before. Turns out it was iSCSI SAN related.
Here is a related link: http://communities.vmware.com/thread/213710
The other articles I read mentioned time outs, rebooting the SAN without taking VMs offline, and other issues.
That link looks similar, but not the same. The post you refer to says that the server is temporarily hanging. In our case, the server is completely locked up and doesn't free up again. Also, the post refers to some indicators prior to the lock up and in the logs to help guide troubleshooting. We have none of that. As we are running other blades in the exact same storage configuration without the problem, I find it highly unlikely that iSCSI is related, however, we have ruled nothing out at this time.
I have exactly the same Symptoms, our config is 4 Blade6000 with 6 to 7
X6450 each, 2 ST6140 with 2 to 3 jbods each. We use 2 NEM (X4250A) on
each Blade and for each X6450 a dual ports FC card.
The only common point with my config is the X6450 module.
We upgraded all modules to vSphere U1, but this lockup persist.
Have you found a solution to this lockup ?
We are experiencing this issue as well with the Sun x6450 blades. Here's our configuration:
2 Sun 6000 Blade Chasses (1 per datacenter)
10 Sun x6450 Blade Modules (6 in one, 4 in the other)
1 4-Port Gigabit Ethernet PCIe card per blade
1 2-Port Gigabit Ethernet + 2-Port FiberChannel PCIe card per blade
VMware ESX 3.5 Update 4, ESX 3.5 Update 5, ESX 4.0, ESX 4.0 Update 1
Description of the issue:
ESX server is marked as Not Responding in vCenter
Console of the server is unresponsive; no errors logged to the console
Network connectivity to the server stops
All virtual machines running on the server are powered onto other hosts (HA host failure response)
Hard reset allows the server to rejoin the cluster
No errors for the lock-up are logged in ESX
No errors are logged in the blade iLOM
No errors are logged in the blade chassis iLOM
Occurs on all 10 blades in both chasses at each datacenter
Have not been able to reliably reproduce the issue
Frequency can be from one day to two months; seemingly random
We have someone at Sun support who may have found a bug with the x6450 and ESX. This has not been confirmed yet.
Sun support has not provided any more information about the potential bug.
We just noticed that the x6450 blade modules are no longer sold on Oracle's website, which makes resolving this issue moot for us. Even if the issue was resolved, we'd have 10 blades across two chasses that we couldn't expand. Moving all ten x6450 blades to one chassis and putting other blades with potentially different processors in the other chassis would cause issues with Site Recovery Manager. It makes more sense to replace these blades with ones that work than to keep waiting until a blade locks up, generate diagnostic logs, forward those to support, and get the call that the logs don't include any clues as to the cause of the lock-up.
We've focused on working with our Sun sales rep. to replace the x6450's with their AMD counterpart, the x6440's. From what I can see, the x6440 is the only 4-socket, six-core blade that's both available for sale and supported by VMware. Of course, the x6450 is supported by VMware, too. We'll see how that goes.
More info on this:
We were able to see that this issue does generate a purple screen error, but only after changing the console's TTY to the vmkernel's diagnostic logging screen (Alt+F12) (see attached screenshot):
CpuSched: VcpuWaitForSwitch: timed out
Looks like it's definitely an issue with the hardware or firmware since this issue occurs on Red Hat and Fedora servers. I guess since ESX is based on Red Hat, there could be a connection.