Hi Guys,
Got a small twin host setup here running ESX 3i 3.5 servers iScsi'ed to an MSA disk shelf. Both hosts are clustered with HA & DRS enabled and all was well for a number of months since it's installation .... until sunday night. Our UPS decided to have a hernia and stopped passing power to our Server/Comms cabinets resulting in the enevitable instant power off of all hardware. Great way to start the year off !
Anyway, 99% of all systems came back up without a worry once we bypassed the dodgy UPS (waiting for that to be fixed) but ever since we have had an "HA Agent has an error" message with the red alert symbol on one host only. The second host has no problems. I have tried a number of things to repair the HA error from the VI console including:
a) Right-Clicking dodgy host and selecting "Reconfigure for VMWare HA" <-- This reults in a ": cmd remove failed:" error in the events list for the host just after the message "enabling HA agent"
b) Removing the dodgy host from the Cluster and then re-joining it
c) Dropping the dodgy host to maintenance mode and then re-enabling it
d) Restarting Dodgy Host
OK, now my linux skills can be easily classed as newbie, but i knew enough to get the following console info for troubleshooting :
1) I logged into the VMWare OS and ran a Test Management Network. Gateway, 2x DNS servers and hostname resolution all successful
2) Based on some other threads i have seen here, I then logged into the console and did a test ping by hostname to the other host and the Virtual Centre server. Both were successful, so i doubt i have a DNS or hostnames problem
3) Did an Alt-F12 to view the logging screen and noticed something that may be related ??? Every 10 minutes or so i get the following entrys :
WARNING: UserThread: 406: peer table full for sfcbd
WARNING: World: vm 204457: 910: init fn user failed with: Out of resources!
WARNING: World: vm 20457: 1775: WorldInit failed: trying to cleanup.
That may or may not be related, but i have a gut feeling it is.
Any ideas what i need to check next in order to get HA back functioning again on this host?
Thx
Mitch
Got a small twin host setup here running ESX 3i 3.5 servers iScsi'ed to an MSA disk shelf. Both hosts are clustered with HA & DRS enabled and all was well for a number of months since it's installation .... until sunday night. Our UPS decided to have a hernia and stopped passing power to our Server/Comms cabinets resulting in the enevitable instant power off of all hardware. Great way to start the year off !
Anyway, 99% of all systems came back up without a worry once we bypassed the dodgy UPS (waiting for that to be fixed) but ever since we have had an "HA Agent has an error" message with the red alert symbol on one host only. The second host has no problems. I have tried a number of things to repair the HA error from the VI console including:
a) Right-Clicking dodgy host and selecting "Reconfigure for VMWare HA" <-- This reults in a ": cmd remove failed:" error in the events list for the host just after the message "enabling HA agent"
b) Removing the dodgy host from the Cluster and then re-joining it
c) Dropping the dodgy host to maintenance mode and then re-enabling it
d) Restarting Dodgy Host
OK, now my linux skills can be easily classed as newbie, but i knew enough to get the following console info for troubleshooting :
1) I logged into the VMWare OS and ran a Test Management Network. Gateway, 2x DNS servers and hostname resolution all successful
2) Based on some other threads i have seen here, I then logged into the console and did a test ping by hostname to the other host and the Virtual Centre server. Both were successful, so i doubt i have a DNS or hostnames problem
3) Did an Alt-F12 to view the logging screen and noticed something that may be related ??? Every 10 minutes or so i get the following entrys :
WARNING: UserThread: 406: peer table full for sfcbd
WARNING: World: vm 204457: 910: init fn user failed with: Out of resources!
WARNING: World: vm 20457: 1775: WorldInit failed: trying to cleanup.
That may or may not be related, but i have a gut feeling it is.
Any ideas what i need to check next in order to get HA back functioning again on this host?
Thx
Mitch