VMware Cloud Community
dTardis
Contributor
Contributor

ESX 4 host becomes un-responsive

We have had this problem twice now. Each time on a different host. We have 3 hosts. All Dells, two of them R900's and a newer R910. All are currently running 4.0 U2.

Systems have been in place and running without problems for more than a year. They are connected to a EMC CX3/20, and a HP LeftHand 4300G2 which has been added in the last 6 months. The EMC is connected with fiber, and the HP with a deticated/isolated network.

The problem has only happeded with the R900's so far. The first one about 3 months ago. The server was compleatly unresponsive. vCenter said it was offline, but did not start HA for some reason. I could ping the host, but vSphere client could not login to it, and SSH didn't work, AND you could not login from the physical console. We were able to remote into the VM's and shut them down. Then we had to hard reboot the physical host. Once it came back up tech support started investigating what happened. What they found was that the server had stopped logging anything about a week prior. They had no explanation as to what caused this. We waited a few days and they looked at the logs again and found no errors or problems.

Today we had much the same problem. This time on the other R900 server. There were a couple of differences. I could not remote into the VM's and shut them down, they were unresponsive. This time logging stopped at about the time the incident started. This time the tech on the phone suggested that it could be a problem with the storage. Specifically the HP storage. Now I understand that the host systems do get very very angry if they can't talk to the storage, but this was not affecting the other hosts that are also connecting to the SAN. I then talked to the LeftHand group at HP and asked them to look through the logs and collect them. They did. They did find one thing. There is a set of errors that the tech believed said basically this: The SAN did what the Host (name of none responsive host) asked me to do, but when I told it that the task was completed it didn't acknowledge me.

Other than that they found no problems. I asked them to look at the other metrics and network settings and they found nothing that was a problem.

I am going to send logs to VMware to look at, but I'm not hopeful that a cause will be found. Has anyone else seen this? I just really dont want this to happen anymore. Right now the only thing that I can think of is to reboot my host systems once a month.

Reply
0 Kudos
4 Replies
idle-jam
Immortal
Immortal

the possibilities are endless. sending to vmware for a support request is the best thing to do. good luck.

Reply
0 Kudos
beyondvm
Hot Shot
Hot Shot

Can you post the file /var/log/vmksummary from the un-responsive host?

--- If you found any of my comments helpful please consider awarding points for "Correct" or "Helpful". Thanks!!! www.beyondvm.com
Reply
0 Kudos
dTardis
Contributor
Contributor

What is the best way for me to do that?

Reply
0 Kudos
dTardis
Contributor
Contributor

After working with the VMware Support team for the last week it is clear no one has any idea what the problem is exactly. They are still going over logs and looking for the problem, but I currently feel that it is hardware related at this point.

So I am closing this post out.

Reply
0 Kudos