Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoComp...

mschau · ‎11-07-2012

Had a full poweroutage at a customer datacenter. Power was restored but HA did not start up any VMs automatically after the hosts came up.

I don't understand why HA failed, and a googling for some of the repeated errors returned troubling little.

I took at look at the fdm.log and I see that it continuously failed to find suitable 'hosts' for the VMs for over an hour.. the VMs were eventually manually powered on when the system admin arrived at the DC over an hour after the environment powered up.

Here's a snippet of where it decides there are 65 VMs that need to be powered up but then immediately starts to fault them..

2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::GenerateFailoverRecommendation] 65 vms added to domain config
2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass2: respect host preference but not failover hosts
2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms

And then immediately it faults all 65 with the same error..

2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b08b88]
2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none

all the way through all 65..

Then it tries another pass

2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass3: use all compatible hosts
2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms
2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b00348]
2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none

After that pass it then changes to the vim.fault.NoCompatibleHost error

2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::PlacementUpdateCb] No recommendation is generated
2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::HandleNotPlacedVms] Reset Vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, vim.fault.NoCompatibleHost

Then it just outputs the following for almost 2 hours..

2012-11-03T15:12:37.824Z [FFE89B90 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65
2012-11-03T15:12:37.824Z [FFE89B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;

....

2012-11-03T16:57:37.931Z [70B13B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;
2012-11-03T16:58:37.933Z [FFB9C460 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65

At this point one of the sysadmins arrived at the DC and started to power on the VMs manually, they started up no problem.

So what gives? Why did HA fail so badly? If there were no compatible hosts why could the sysadmin just turn the VMs on no problems?

(at first I thought maybe it was access to the datastores but the admin didn't have to do anything, they were all listed when he connected direct to the ESX host).

All the hosts were up, the HA Cluster config is as follows:

"Enable: Disallow VM power on operations that violate availability constraints"

"Precentage of cluster resources reserved as failover spare capacity: 50% CPU, 50% Memory"

All other options are default.

Since it was a full power down situation there were no running VMs, just ESX hosts. (This includes vCenter being down)

So my question is.. has anyone else seen this? Do you know why it happened?

depping · ‎11-08-2012

I don't know why it failed, but the error messages (as you stated) indicate that HA is under the impression that none of the hosts are compatible with the VMs. In your case the VMs could be powered on so HA was doing something weird.

I suggest you contact support and ask them to pinpoint the problem so you can prevent it.

depping · ‎11-08-2012

It is almost like the compatibility list was not up to date for some weird reason...

If you do log a ticket with support could you place the SR number here so I can have a look and follow it? Thanks,

depping · ‎11-08-2012

It is probably too late, but they could have checked if the compatibility list was okay or not through the commandline:

List all VMs and the hosts they are compatible with:

5.1:

./prettyPrint.sh vmmetadata

5.0:

./prettyPrint.sh compatlist

mschau · ‎11-08-2012

Thanks Duncan for the feedback, I'll keep that script in mind.

As another 'interesting' facet of this.. further discussions with the sysadmin and review of the logs shows that shortly after vCenter came up HA started to work..

So timeline..

Sysadmin connects, starts a single DC and the vcenter machines.

After a short period he then starts all of the VMs that were associated with that host (17 + 2 previous = 19)

A few minutes later HA actually started to boot up machines..

Review of the logs shows that around the time vCenter fully booted the host logs the following:

2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ProcessRequestsLoop] Process a batch of 1 requests
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE] Required number of AgentVm = 0
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Start snapshotting.
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Retrieve hosts.
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-250, drmId = 0, UnreservedMem = 123749MB, UnreservedCpu = 37305MHz
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-231, drmId = 1, UnreservedMem = 125378MB, UnreservedCpu = 37305MHz
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = __localhost__, drmId = 2, UnreservedMem = 125374MB, UnreservedCpu = 37305MHz
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-235, drmId = 3, UnreservedMem = 125380MB, UnreservedCpu = 37305MHz
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Retrieve RPs.
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmResourcePools] Pool for root: UnreservedCpu = 149220Mhz, UnreservedMem = 499881Mb
2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [PlacementManagerImpl::GetPendingPlacementExecution] 0 Vms are being placed
2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host host-231
2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 19 powered-on Vms for Host host-250
2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host __localhost__
2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host host-235
2012-11-03T17:07:37.951Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::GenerateFailoverRecommendation] Snapshot is valid
2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ProcessRestartPriority] 46 Vms have priority 32
2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::GenerateFailoverRecommendation] 46 Vms are to be powered on

Then it looks like it starts to recalculate the Compatibility Map for each VM:

2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Invt' opID=SWI-9ae5a3a7] [InventoryManagerImpl::ComputeCompatMap] Compat list for /vmfs/volumes/4ce7071e-2704a142-6c48-001517c9195e/XXXXVM/XXXXVM.vmx: __localhost__, host-254, host-235, host-241, host-250, host-231

It then starts to boot the VMs..

A scan of the log shows no 'ComputeCompatMap' previous to this (how often is this supposed to happen?), does this function only happen at the direction of vCenter?

admin · ‎11-08-2012

vCenter server computes the compat set ("which hosts can each vm run on") whenever there is a configuration change or when vms power on or off. It then sends this data to fdm and fdm persists it on local storage so that a cached copy is available even if vCenter is down. It looks like fdm didn't have an up-to-date copy of the compat set or it was lost when all the hosts lost power. Would be great if you could file an SR so someone here can look at the logs to see what happened.

Elisha

mschau · ‎11-08-2012

SR: 12242973811

depping · ‎11-08-2012

Do you happen to know what was down first, your storage environment or your hosts?

mschau · ‎11-08-2012

I have to verify, if they didn't drop at the same time then the hosts would have dropped before the SAN, I believe the VNX that is in use has a battery backup in addition (but not seperate) to the UPS that powered the rest of the datacenter.

All

vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE