8 Replies Latest reply on Nov 8, 2012 6:37 PM by mschau

    vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE

    mschau Novice

      Had a full poweroutage at a customer datacenter.  Power was restored but HA did not start up any VMs automatically after the hosts came up.

      I don't understand why HA failed, and a googling for some of the repeated errors returned troubling little.

      I took at look at the fdm.log and I see that it continuously failed to find suitable 'hosts' for the VMs for over an hour.. the VMs were eventually manually powered on when the system admin arrived at the DC over an hour after the environment powered up.

      Here's a snippet of where it decides there are 65 VMs that need to be powered up but then immediately starts to fault them..

      2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::GenerateFailoverRecommendation] 65 vms added to domain config
      2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass2: respect host preference but not failover hosts
      2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms

      And then immediately it faults all 65 with the same error..

      2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b08b88]
      2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none

      all the way through all 65..

      Then it tries another pass

      2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass3: use all compatible hosts
      2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms
      2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b00348]
      2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none

      After that pass it then changes to the vim.fault.NoCompatibleHost error

      2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::PlacementUpdateCb] No recommendation is generated
      2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::HandleNotPlacedVms] Reset Vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, vim.fault.NoCompatibleHost

       

      Then it just outputs the following for almost 2 hours..

      2012-11-03T15:12:37.824Z [FFE89B90 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65
      2012-11-03T15:12:37.824Z [FFE89B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;

      ....

      2012-11-03T16:57:37.931Z [70B13B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;
      2012-11-03T16:58:37.933Z [FFB9C460 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65

       

      At this point one of the sysadmins arrived at the DC and started to power on the VMs manually, they started up no problem.

       

      So what gives?  Why did HA fail so badly?  If there were no compatible hosts why could the sysadmin just turn the VMs on no problems?

      (at first I thought maybe it was access to the datastores but the admin didn't have to do anything, they were all listed when he connected direct to the ESX host).

      All the hosts were up, the HA Cluster config is as follows:

      "Enable: Disallow VM power on operations that violate availability constraints"

      "Precentage of cluster resources reserved as failover spare capacity: 50% CPU, 50% Memory"

      All other options are default.

      Since it was a full power down situation there were no running VMs, just ESX hosts. (This includes vCenter being down)

       

      So my question is.. has anyone else seen this?  Do you know why it happened?

        • 1. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
          depping Champion
          User ModeratorsVMware Employees

          I don't know why it failed, but the error messages (as you stated) indicate that HA is under the impression that none of the hosts are compatible with the VMs. In your case the VMs could be powered on so HA was doing something weird.

           

          I suggest you contact support and ask them to pinpoint the problem so you can prevent it.

          • 2. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
            depping Champion
            VMware EmployeesUser Moderators

            It is almost like the compatibility list was not up to date for some weird reason...

             

            If you do log a ticket with support could you place the SR number here so I can have a look and follow it? Thanks,

            • 3. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
              depping Champion
              User ModeratorsVMware Employees

              It is probably too late, but they could have checked if the compatibility list was okay or not through the commandline:

               

              List all VMs and the hosts they are compatible with:

               

              5.1:

              ./prettyPrint.sh vmmetadata

               

              5.0:

              ./prettyPrint.sh compatlist

              1 person found this helpful
              • 4. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
                mschau Novice

                Thanks Duncan for the feedback, I'll keep that script in mind.

                 

                As another 'interesting' facet of this..  further discussions with the sysadmin and review of the logs shows that shortly after vCenter came up HA started to work..

                 

                So timeline..

                Sysadmin connects, starts a single DC and the vcenter machines.

                After a short period he then starts all of the VMs that were associated with that host (17 + 2 previous = 19)

                 

                A few minutes later HA actually started to boot up machines..

                Review of the logs shows that around the time vCenter fully booted the host logs the following:

                 

                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ProcessRequestsLoop] Process a batch of 1 requests
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE] Required number of AgentVm = 0
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Start snapshotting.
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Retrieve hosts.
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-250, drmId = 0, UnreservedMem = 123749MB, UnreservedCpu = 37305MHz
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-231, drmId = 1, UnreservedMem = 125378MB, UnreservedCpu = 37305MHz
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = __localhost__, drmId = 2, UnreservedMem = 125374MB, UnreservedCpu = 37305MHz
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructOneDrmHost] Successfully retrieve host,hostId = host-235, drmId = 3, UnreservedMem = 125380MB, UnreservedCpu = 37305MHz
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ConstructClusterSnapshot] Retrieve RPs.
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmResourcePools] Pool for root: UnreservedCpu = 149220Mhz, UnreservedMem = 499881Mb
                2012-11-03T17:07:37.949Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [PlacementManagerImpl::GetPendingPlacementExecution] 0 Vms are being placed
                2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host host-231
                2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 19 powered-on Vms for Host host-250
                2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host __localhost__
                2012-11-03T17:07:37.950Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::RetrieveDrmVmOnHost] Retrieved 0 powered-on Vms for Host host-235
                2012-11-03T17:07:37.951Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::GenerateFailoverRecommendation] Snapshot is valid
                2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::ProcessRestartPriority] 46 Vms have priority 32
                2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Placement' opID=SWI-9ae5a3a7] [DrmPE::GenerateFailoverRecommendation] 46 Vms are to be powered on

                Then it looks like it starts to recalculate the Compatibility Map for each VM:

                 

                2012-11-03T17:07:37.953Z [70BD6B90 verbose 'Invt' opID=SWI-9ae5a3a7] [InventoryManagerImpl::ComputeCompatMap] Compat list for /vmfs/volumes/4ce7071e-2704a142-6c48-001517c9195e/XXXXVM/XXXXVM.vmx: __localhost__, host-254, host-235, host-241, host-250, host-231

                 

                It then starts to boot the VMs..

                A scan of the log shows no 'ComputeCompatMap' previous to this (how often is this supposed to happen?), does this function only happen at the direction of vCenter?

                • 5. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
                  Expert

                  vCenter server computes the compat set ("which hosts can each vm run on") whenever there is a configuration change or when vms power on or off. It then sends this data to fdm and fdm persists it on local storage so that a cached copy is available even if vCenter is down. It looks like fdm didn't have an up-to-date copy of the compat set or it was lost when all the hosts lost power. Would be great if you could file an SR so someone here can look at the logs to see what happened.

                   

                  Elisha

                  • 7. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
                    depping Champion
                    VMware EmployeesUser Moderators

                    Do you happen to know what was down first, your storage environment or your hosts?

                    • 8. Re: vim.fault.NoCompatibleHost N3Vim5Fault16NoCompatibleHostE
                      mschau Novice

                      I have to verify, if they didn't drop at the same time then the hosts would have dropped before the SAN, I believe the VNX that is in use has a battery backup in addition (but not seperate) to the UPS that powered the rest of the datacenter.