VMware Cloud Community
Lightbulb
Virtuoso
Virtuoso
Jump to solution

RE ESX 3.0.1 Host access issue.

This is not a troubleshooting question per se but more of "Has anyone heard of this issue before?" Where I work another group has a ticket open with Vmware regards a ESX 3.0.1 system (Don't know the exact build).

The system is a HP DL585 G1. Once in a while the system shows as disconnected in VC. When in this state the system cannot be accessed via SSH or via standalone VIC. The system cannot be accessed from the console. The VMs (Which reside on SAN LUNS) keep running and show no signs of issues. To resolve the issue the system has to be power cycled, which the client does not like.

There are no obvious indications of errors in /var/log/messages /var/log/vmkernel etc etc. The HP agents are a little out of date (They are at 7.8 where lattest is 8.11) but there are no weird messages in /var/spool/compaq/cma.log and no HP alerts that sound like this issue in the revision history for the agents

The Vmware tech says that from the look of the log data the SC has gone into a readolnly state (I get this second hand) as the logs just stop when the host becomes inaccessible.

I will take a closer look at the local storage on the box tomorrow, it just seems a little odd that a host can become this degraded without any impact on the VMs (Guess that is a good thing)

If anybody has any experience of a similar problem I would be happy to hear. Hopefully Vmware support will have this all sorted sometime soon

Thanks.

Reply
0 Kudos
1 Solution

Accepted Solutions
kjb007
Immortal
Immortal
Jump to solution

The problem that you've already seen is that it is very difficult to troubleshoot this issue from the ESX side alone. Since there are no logs that can be looked at after the fact, because of the read-only issue, it's difficult for vmware support to collect those logs and find much useful information from them. Hopefully that is not the case, and you can see some events that lead up to the eventual state the server went into. This would be a good time to suggest the VIMA appliance, so that syslog has an alternate server to log errors. Depending on the sequence of writes, this may or may not have more information, but I haven't run into this scenario since I've setup my logging server to validate whether you would or would not see alerts during that condition.

Good luck. Hopefully you'll have some good data.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

Reply
0 Kudos
5 Replies
Rob_Bohmann1
Expert
Expert
Jump to solution

I ran into this similar situation on some IBM's (460's - with the lunch pail for processor boards) a few years ago after a DR test. :smileylaugh:

In this case, I got in front of the server and was able to see the console but was unable to log in or do anything to interact with the console OS -" the console OS was read only" would have been a good description - the console was scrolling messages - I cannot remember what they said.

I ended up having to poke it in the eye after remoting in and shutting down all the guests. We were running 3.0.1 and I would gamble that is would have been a build number in the low to mid 30,000's (33-35xxx) as this would have been in the timeframe of either Jan or April/May of 2007.

I do not remember if this happened again over the next few months. These hosts were patched regularly. So to asnwer your question, Yes I have seen it, I do not remember the cure - I would guess patches.

How are those 585G1's for ASR's? Just wondering, had a less than stellar experience with some once upon a time.

kjb007
Immortal
Immortal
Jump to solution

This has happened to me a few times, on ESX, as well as regular RHEL 4/5 physical servers. Since the service console is Linux based, it behaves similarly to the RHEL4/5 physical machines I have. If the boot LUN (my servers boot from SAN, but this can happen locally as well, although it's not as likely off of local storage), loses connection to the / filesystem for more than 45 seconds, it will protect itself by going into read-only mode for /. This allows all of the items running in memory, and on the separate filesystems, which is why the host continues to ping, and the vm's continue to run. But you can't login. You see a prompt, but the pseudo terminal is running in memory, so it will prompt you for login, but after you have authenticated successfully, that message has to be written to the logfiles, which are now on a read-only filesystem, so no more processing. You're basically stuck.

I've been able to save some downtime by disconnecting the server from vc, browsing the datastore and re-registering the vm's, and then going into the vm's and shutting them down, and immediately restarting them from vc. That way, the vm's require a shutdown, but at least it's not a crash. This does assume you have a cluster though.

Very frustrating, but ultimately it's an I/O issue to your storage. You can increase the disk timeout to make this less likely a scenario, and it's a balancing act.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Lightbulb
Virtuoso
Virtuoso
Jump to solution

The host is one of three old non-clustered systems that belong to this client. I will post the results of Vmware investigation to this thread when I hear of them. Don't have time to look at the local storage as I have to go back to my DBA training. Since I am not a DBA, nor inclined to be one, this has been a week of chronic MEGO for me Smiley Happy

Thanks for the info.

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

The problem that you've already seen is that it is very difficult to troubleshoot this issue from the ESX side alone. Since there are no logs that can be looked at after the fact, because of the read-only issue, it's difficult for vmware support to collect those logs and find much useful information from them. Hopefully that is not the case, and you can see some events that lead up to the eventual state the server went into. This would be a good time to suggest the VIMA appliance, so that syslog has an alternate server to log errors. Depending on the sequence of writes, this may or may not have more information, but I haven't run into this scenario since I've setup my logging server to validate whether you would or would not see alerts during that condition.

Good luck. Hopefully you'll have some good data.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Lightbulb
Virtuoso
Virtuoso
Jump to solution

My colleagues rebuilt the hosts to 3.5. So all is well that ends well

Reply
0 Kudos