VMware Cloud Community
sjoerdhooft
Contributor
Contributor

vMotion on disconnected host

Hi everyone,

we are in the middle of upgrading our ESX 4.1 update 1 hosts to ESXi 5. Yesterday night, one of the hosts stopped responding while scanning the host for updates (through update manager). Now the host is disconnected from vCenter, but there are still running production VMs on it. I already tried to restart the mgmt-vmware and vmware-vpxa agents, that worked but I still can't connect to the host. Now I was hoping that anyone knew of a way to get the VMs to other hosts without downtime.

Please let me know!

thanx in advance!

Sjoerd Hooft

Reply
0 Kudos
9 Replies
a_p_
Leadership
Leadership

Unfortunately you cannot live migrate a VM from a disconnected host, because the action is initiated from vCenter and therefore needs access to the involved hosts.

Do you see any warnings/error when you try to reconnect the host?

André

Reply
0 Kudos
sjoerdhooft
Contributor
Contributor

Well, it says it's connecting, or actually retrieving vCenter agent data from the host,but it just stays like that and eventually times out. I also cannot connect to the host directly using the vSphere client. I can logon using ssh, but then the host is also quite unresponsive. A ls -l in the /var/log directoy can take up to 1 minute. I'm looking at the logs and see this error in the vmkernel log at the time it happened:

Nov 22 23:35:20 esxprd85 vmkernel: 49:09:31:12.891 cpu7:4348)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x41027ef66340) to NMP device "mpx.vmhba5:C0:T0:L0" failed on physical path "vmhba5:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Nov 22 23:35:20 esxprd85 vmkernel: 49:09:31:12.892 cpu7:4348)ScsiDeviceIO: 1672: Command 0x12 to device "mpx.vmhba5:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

Have you only tried to manually restart the vCenter agent via /etc/init.d/vpxa restart or did you do a full services.sh restart?

Do the services crash again shortly after that or are they running?

Can you still connect directly to the host with the vSphere Client? If yes, then the vCenter agent might be broken.

Try a manual uninstall of the vCenter Agent and reconnect the host in vCenter again to push a fresh installation as described here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100371...

Check esxtop for storage latency, if you have any issues on the SAN-side and if the vmkernel logs contains any messages indicating a LUN in an APD state. The ESX(i) hostd agent has the habit of becoming unusable if a storage device is unreachable. The agent retries to issue IO infinitely, eventually locking up the whole management interface.

See:

http://cormachogan.com/2012/09/07/vsphere-5-1-storage-enhancements-part-4-all-paths-down-apd/

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103098...

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
a_p_
Leadership
Leadership

What is connected to "vmhba5"? You may want to check whether this is a hardware/storage (access) issue.

André

Reply
0 Kudos
sjoerdhooft
Contributor
Contributor

There is no services.sh on esx 4.1. Can't find it anyway. I restarted vpxa, hostd, fdm and webaccess since I found notices and warnings of these services in the logs. I can't connect directly either.

LUN 0 should be the boot lun, and those are local disks. I can't imagine (but can't rule it out either) that these would be overloaded, there are no VMs on local storage.

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

Oh yea, there is no services.sh in ESX classic, I was misreading that for ESXi.

Are you sure that vmhba5 is the local storage controller (it's usually vmhba0)? Run esxcfg-scsidevs -a and esxcfg-scsidevs -l -d mpx.vmhba5:C0:T0:L0 to identify it.

Also, the vmkernel messages you posted appear to be from yesterday, unless they are still occuring I doubt they are relevant to your problem. If you don't have any issues on the storage side, you could still try the vCenter agent reinstallation.

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
sjoerdhooft
Contributor
Contributor

Turns out vmhba5 is the cdrom hba, vmhba2 is the local disk.

The message I posted appeared at the time the issue occured, that's why I thought it was relevant, but now it's clear it's the cdrom I'm not so sure anymore.

Is there a "restart all services" option on ESX? Or a list with the right order for the services to stop/start?

Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

You already did it correctly on ESX with restarting the mgmt-vmware and vmware-vpxa service:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100349...

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
sjoerdhooft
Contributor
Contributor

Well... the server just started responding again. My collegue moved some VMs by turning them off and suddenly everything was ok again. Thanks everyone for taking the time to respond.

Reply
0 Kudos