jhdore
Contributor
Contributor

ESX 3.0.2 "disconnected" after upgrade to VC 2.5

Hi all,

I've partially upgraded my two-node VI cluster - one node has upgraded to ESX 3.5, and my VirtualCentre server has upgraded to 2.5. The Database also upgraded successfully, after running the upgrade manually and tweaking the permissions on MSDB so that my vclogin user had DBO rights.

However, when I started the 2.5 VI Client, it attempted to upgrade the agent on the other, 3.0.2 ESX node, which hung in the process - it was stuck at 80% for 40 or so minutes (coffee break, walk around the garden, read paper-length interval Smiley Happy

During this time, the VM's it was hosting stopped responding as well (I had Vmotion'd them to it so the other node could be upgraded). Reboot.

After the reboot, and kicking the SAN, the server came up and I could connect to it directly with the VI Client, and the VMs it is hosting work normally - this includes the VCMS management server! I could also connect the VI Client to the VCMS, but the VCMS says the node is disconnected. I have deleted the node from the VCMS, and attempted to reconnect it. The reconnection wizard recognises the hardware, and the VM's running on it, but it cannot connect, failing with the following error:

Unable to access the specified host. It either does not exist, the server software is not responding, or there is a network problem.

I'm guessing it's the middle option, since:

- I can ping the node's service console and vmkernel IP addresses from itself, the VCMS, and another machine

- I can connect to it directly with the VI Client

- The VCMS hosted on it talks to the other (3.5.0) node ok

- When I start the VI Client to connect to the 3.0.2 node, it calls up version 2.0 whereas connecting to the 3.5 node calls up version 2.5...

How do I upgrade the agent manually?

Cheers,

James

0 Kudos
23 Replies
rvandalen
Contributor
Contributor

Hi James,

I had exactly the same issue after upgrade.

In my case an old solution working for VC2.0 worked...

Create folder /tmp/vmware-root on your ESX host and try to reconnect it to VC (could take a while to upgrade the agent)

See also:[http://communities.vmware.com/click.jspa?searchID=1042809&objectType=2&objectID=619878]

Regards,

Rene

0 Kudos
jhdore
Contributor
Contributor

Ah, yes - I saw tips to that effect, but when I checked, that directory already existed:

# ls -l

total 14412

drwxr-xr-x 2 root root 4096 Jan 10 15:30 aam

-rw------- 1 root root 2 Jan 11 12:08 cimserver_start.conf

srwxrwxrwx 1 root root 0 Jan 11 12:08 cimxml.socket

-rw-rr 1 root root 14706068 Jan 11 12:15 esx-2008-01-11--12.14.1277.tgz

drwxr-xr-x 2 root root 4096 Jan 11 12:07 hsperfdata_root

drwx------ 2 root root 4096 Jan 11 15:22 vmhsdaemon-0

-rw-rr 1 root root 862 Jan 11 12:07 vmkdump.log

drwx---r-x 2 root root 4096 Jan 14 12:15 vmware-hostd-ticket

drwx------ 2 root root 4096 Jan 11 15:22 vmware-root

Are the permissions consistent with yours?

0 Kudos
rvandalen
Contributor
Contributor

Ok, if you looked at that I guess you already tried restarting the mgmt-vmware service on the console and try reconnect ?

run the following:

service mgmt-vmware restart

If that also fails you can always try to install the agent manually

The following document describes the procedure, you only need to choose the right version applicable for VC2.5:

http://communities.vmware.com/docs/DOC-2192

0 Kudos
jhdore
Contributor
Contributor

Aye, I tried the service mgmt-vmware restart and service vmware-vpxa restart commands. They restarted OK, but didn't change the result. That DOC link looks like it might be on the money though. Back in x mins...

Cheers,

James

0 Kudos
RParker
Immortal
Immortal

I have the same problem on ESX 2.5 server, it was good up until the 2.5.5 release. I don't know if its related, but I tried the same thing, no go either.

ALL my ESX 3.5/3.0 hosts are fine.

0 Kudos
jhdore
Contributor
Contributor

Hmm. It doesn't specifically state which file to use for VC 2.5, although I have some guesses, they are just that:

vpx-upgrade-esx-6-linux-64192

- I'm basing this guess on VMWare keeping the same function for value for X vpx-upgrade-esx-X-linux-YYYYY and making YYYYY the particular version number, IYSWIM.

Cheers,

James

0 Kudos
rvandalen
Contributor
Contributor

You can verify the version needed in the bundleversion.xml file (which is in the upgrade folder

How do I manually install the VC management agent?

If after upgrading VirtualCenter you find some of your ESX hosts disconnected you can manually upgrade the management agent on the ESX server by following the below steps:

  • First log into the ESX server console and check the version on the servers that are disconnected by typing "vpxa- v" The version needs to match the version of VirtualCenter being used.

- VC 2.0.1 build number is 32042

- VC 2.0.1 build number is 32042

- VC 2.0.1 Patch 1 build number is 33643

- VC 2.0.1 Patch 2 build number is 40644

- VC 2.5 build number is 64192

Open the folder for the VC 2.5 installation. By default this will be "C:\Program Files\VMware\Infrastructure\VirtualCenter Server\upgrade"

  • You need to use the correct file for different version of ESX server. You can find your answer in bundleversion.xml

- 2.0.1+ = vpx-upgrade-esx-0-linux-*

- 2.1.0+ = vpx-upgrade-esx-1-linux-*

- 2.5.0 = vpx-upgrade-esx-2-linux-*

- 2.5.1 = vpx-upgrade-esx-3-linux-*

- 2.5.2 = vpx-upgrade-esx-4-linux-*

- 2.5.3+ = vpx-upgrade-esx-5-linux-*

- 3.0.0+ = vpx-upgrade-esx-6-linux-*

- 3.5.0+ = vpx-upgrade-esx-7-linux-*

- e.x.p = vpx-upgrade-esx-7-linux-*

Copy file "vpx-upgrade-esx-y-linux-xxxxx" to your ESX host, where y and xxxxx are based on bundleversion.xml. xxxxx is the build number, ie. vpx-upgrade-esx-6-linux-40644. Use a secure copy utility such as WinSCP or PuTTY PSFTP to copy this file to the ESX server.

  • Login to the ESX server as root.

  • In the directory where you copied the upgrade bundle run the command: sh ./ vpx-upgrade-esx-y-linux-xxxxx (xxxxx is the build number)

  • Once the install is complete run the command "service vmware-vpxa restart" followed by "service mgmt-vmware restart"

  • Check the version again by typing "vpxa -v", the version should now be the new version. Now open your VI Client, try to connect to the ESX host.

0 Kudos
jhdore
Contributor
Contributor

Ok, vpxa -v yeilds nowt 😆

# vpxa -v

-bash: vpxa: command not found

# services vpxa -v

-bash: services: command not found

# service vpxa -v

vpxa: unrecognized service

# service vmware-vpxa -v

Usage: vmware-vpxa {start|stop|status|restart}

# locate vpxa

-bash: locate: command not found

# whereis vpxa

vpxa:

#

rpm -q lists

VMware-vpxa-2.5.0-64192

as the last item, however. so it would seem it's partially installed at least. (rpm -rebuilddb gives no errors)

Based on the .xml file, I chose vpx-upgrade-esx-6-linux-64192 - as I'm upgrading 3.0.2 to build level 64192.

It ran, but vpxa still doesn't exist, and the VC Client still can't connect.

# sh ./vpx-upgrade-esx-6-linux-64192

# service vmware-vpxa restart

Stopping vmware-vpxa:

Starting vmware-vpxa:

# service mgmt-vmware restart

Stopping VMware ESX Server Management services:

VMware ESX Server Host Agent Watchdog

VMware ESX Server Host Agent

Starting VMware ESX Server Management services:

VMware ESX Server Host Agent (background)

Availability report startup (background)

# vpxa -v

-bash: vpxa: command not found

Might just down the node, and upgrade it anyway. I can take the outage if I do it at a lunchtime.

Cheers,

James

0 Kudos
Byron_Zhao
Enthusiast
Enthusiast

I had the same problems too. Well, actually it was worst than that. One host's service console lost network connection, whille another host kept having HA agent error. For the first host, after stopping firewall, restart hostd, it picked up its network connection. For the second host, disable HA for the whole cluster, pause, and restart HA again. It fixed the HA problem for me.

0 Kudos
sandu
Enthusiast
Enthusiast

James,

Remove the vpxa rpm manually (From the host that you are not able to reconnect) and try to connect to the machine again from VC. This might fix your problem.

-Sandu

jhdore
Contributor
Contributor

Downed node and upgraded from CD. Now working ok.

0 Kudos
RUG201110141
Enthusiast
Enthusiast

Had the same issue. I opened up a case with VMware and they acted like I was making the problem up. What I did was similar to Byron_Zhao. I disabled the firewall (incoming and outgoing) and restarted several things to include hostd and the service console network connectivity immediately came back. I then renabled the firewall on the machine and everything was fine. I also had HA failing on a bunch of machines. Probably, about 30% of the ESX servers I have had HA issues when going to VirtualCenter 2.5. I opened a case with VMware. The perused around the ESX servers, checked versions, checked logs, and acted like I made the issue up. After several hours the problem magically went away. I wonder if it's because of the machines that had lost service console connecitivity. I don't know how the HA functionality interelates with machines in a cluster, but if the primary machine in the cluster is offlince for whatever reason whould HA have an issue on other machines?

0 Kudos
Byron_Zhao
Enthusiast
Enthusiast

I submitted my tech support via web, and accidently open two cases. Both tech supports checked out the logs, and couldn't find any software error. They both said it must be hardware problem, either is a bad cable, bad switch port, or some other external errors. I didn't change anything, but everything has been working fine since then. I think disabling HA before upgrading VirtualCenter 2.5 might be a way to get around this problem. I did disabled HA in the previous upgrade, but didn't realize it is the same with 2.5.

-BZ

0 Kudos
RUG201110141
Enthusiast
Enthusiast

Actually, now that you mention that the tech support person from VMware said that disabling HA prior to the VirtualCenter 2.5 upgrade was recommended. I pointed out that it doesn't say that anywhere in the documentation and they were going to review that. That was the last I heard.

0 Kudos
jhdore
Contributor
Contributor

Huh, this is beginning to make more and more sense. Since I got the problem node (node0) working, the other node (node1) which had the upgrade is reporting HA error: "Internal AAM error. Agent did not start".

Are there docs on how to install the AAM agent manually?

Cheers,

James

0 Kudos
Byron_Zhao
Enthusiast
Enthusiast

James,

No need to reinstall AAM. I had the same issue, and worked with tech support for two hours to manually add nodes to the cluster. However, it kept going on and off. The tech support finally gave up and said would get back to me. I turned off HA before I went home, and two hours later after I got home and finished dinner, it started working fine after I re-enabled it.

Hope this helps.

-BZ

0 Kudos
azn2kew
Champion
Champion

Depends on which VMware Support engineer you're getting, if you get Platinum Support you're get your money worth every pennies. I've experienced really good with VMware Support Engineer, and always requested to assigned to him because he is straight to the problems and start troubleshoot with sharp analytical skills. 99% he solved the problems without asking to look more into the issues. Especially with large ESX cluster farms we're hosting bunch of PE 6950's hosts running...you get the idea.

We've experience exact same issue, the way to fix this is basically removed VPX agent and reinitiated by adding the host back to the cluster.

1. Use "rpm -qa | grep vpxa*" will find the result of your current vpx agent version. (simply "vpxa -v" ) quicker

2. Use "rpm -e vpxaversion" to remove the vpx agent and than re-add your host to the cluster.

Also sometimes, it works well if you install the correct version from the CD media like someone has mentioned above and worked quit well too. But first double check with your infrastructure settings before apply any of these changes.

a. DNS settings

b. /etc/hosts entries of the specified servers

c. network connectivity

d. basic "service mgmt-vmware restart & service vmware-vpxa restart" commands

e. try to remove the host from cluster and re-enable HA.

f. reboot your host if needed. (than you can use those techniques above)

Hope this help!

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
0 Kudos
taberj
Contributor
Contributor

I took the recommendation of remove then reconnect of vpxa:

rpm -e VMware-vpxa

Stopping vmware-vpxa:[ OK ]

warning: /etc/opt/vmware/vpxa/vpxa.cfg saved as /etc/opt/vmware/vpxa/vpxa.cfg.rpmsave

This worked for me.

0 Kudos
fscked
Contributor
Contributor

I am still having this issue after trying all of these suggestions.

I have VC2.5 and a cluster. In that cluster I have a 3.0.1 box that works fine. I have another box that was 3.0.1 and didn't work after all these suggestions. I tried to upgrade to 3.5 and now it is still giveing me the "Unable to access the specified host. It either does not exist, the server software is not responding, or there is a network problem."

Any suggestions?

0 Kudos