I have a fairly large configuration currently, but have also seen similar issues with slightly smaller configurations. The config I have now is a 12 node cluster, running ESXi6 and vCenter 6. I have DRS and HA enabled, and "should" rules for every VM. I have about 2000 VMs.
I'm seeing, from what I gather, are extremely long times to collect and process the available resources across the cluster. From the vpxd-log:
vpxd-13.log:2016-06-22T10:19:32.471-04:00 warning vpxd[08328] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-45] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 2739 ms
vpxd-13.log:2016-06-22T10:20:05.283-04:00 warning vpxd[08592] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-7d] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3938 ms
vpxd-13.log:2016-06-22T10:20:44.290-04:00 warning vpxd[09268] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-3e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 6140 ms
vpxd-13.log:2016-06-22T10:21:54.128-04:00 warning vpxd[07136] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d9] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 15938 ms
vpxd-13.log:2016-06-22T10:22:44.581-04:00 warning vpxd[02100] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d0] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 4271 ms
vpxd-13.log:2016-06-22T10:23:46.249-04:00 warning vpxd[09148] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-5e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3618 ms
I'm also occasionally seeing, especially during the >>10 second delays:
warning vpxd[02128] [Originator@6876 sub=drmLogger opID=task-internal-1-7dc43530-e3] [VpxdDrmInterface::AskAndRefreshForDrmRecommendationsDo] At least 50% of cluster hosts (12 in total) are not reporting quickStats. Switching to alloc-only invocation.
I believe the above message is accompanied by an alert in vCenter for the cluster: "DRS Invocation Not Completed". When I see this message, DRS stops entirely and I need to, at a minimum, restart vCenter to get DRS to start migrating again.
I have also seen via PowerCLI inconsistent cluster stats:
PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary}
CurrentFailoverLevel : 0
AdmissionControlInfo : VMware.Vim.ClusterFailoverLevelAdmissionControlInfo
NumVmotions : 6985
TargetBalance : -1000
CurrentBalance : -1000
UsageSummary : VMware.Vim.ClusterUsageSummary
CurrentEVCModeKey : intel-haswell
DasData : VMware.Vim.ClusterDasDataSummary
TotalCpu : 855512
TotalMemory : 4944999247872
NumCpuCores : 336
NumCpuThreads : 672
EffectiveCpu : 767622
EffectiveMemory : 4542453
NumHosts : 12
NumEffectiveHosts : 12
OverallStatus : red
PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary.UsageSummary}
TotalCpuCapacityMhz : 0
TotalMemCapacityMB : 0
CpuReservationMhz : 0
MemReservationMB : 0
PoweredOffCpuReservationMhz : 0
PoweredOffMemReservationMB : 0
CpuDemandMhz : 0
MemDemandMB : 0
StatsGenNumber : -1
CpuEntitledMhz : 0
MemEntitledMB : 0
PoweredOffVmCount : 0
TotalVmCount : 0
Additionally, possibly related, I have issues powering on VMs when in this state. I presume this is due to DRS getting wedged and preventing power-on.
I'd appreciate any analysis folks could offer, including any additional places I could look to track down the source of the DRS response times as I feel that's a sure sign of trouble. Oh, I also get occasional hostd cores. So, there's a lot of things I'm trying to correlate together.
Thanks,
Tyler
I'm seeing similar things here:
2016-08-16T14:15:24.328Z warning vpxd[7FAA976ED700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-313dc3b7] [DRS]NotifyDeviceModelChanges [TotalTime] took 13205 ms
2016-08-16T14:15:24.330Z warning vpxd[7FAA626CD700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-a23e30] [DRS]NotifyDeviceModelChanges [TotalTime] took 13077 ms
2016-08-16T14:15:24.383Z warning vpxd[7FAAD6949700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-6c566149] [DRS]NotifyDeviceModelChanges [TotalTime] took 13134 ms
2016-08-16T14:15:24.385Z warning vpxd[7FAAA97AF700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-5708c280] [DRS]NotifyDeviceModelChanges [TotalTime] took 13127 ms
I have no idea why this happens, but DRS seems to be working correctly so far.