Extremely long DRS Refresh Recommendation times, w...

Tyler14 · ‎06-22-2016

I have a fairly large configuration currently, but have also seen similar issues with slightly smaller configurations. The config I have now is a 12 node cluster, running ESXi6 and vCenter 6. I have DRS and HA enabled, and "should" rules for every VM. I have about 2000 VMs.

I'm seeing, from what I gather, are extremely long times to collect and process the available resources across the cluster. From the vpxd-log:

vpxd-13.log:2016-06-22T10:19:32.471-04:00 warning vpxd[08328] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-45] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 2739 ms

vpxd-13.log:2016-06-22T10:20:05.283-04:00 warning vpxd[08592] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-7d] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3938 ms

vpxd-13.log:2016-06-22T10:20:44.290-04:00 warning vpxd[09268] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-3e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 6140 ms

vpxd-13.log:2016-06-22T10:21:54.128-04:00 warning vpxd[07136] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d9] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 15938 ms

vpxd-13.log:2016-06-22T10:22:44.581-04:00 warning vpxd[02100] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d0] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 4271 ms

vpxd-13.log:2016-06-22T10:23:46.249-04:00 warning vpxd[09148] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-5e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3618 ms

I'm also occasionally seeing, especially during the >>10 second delays:

warning vpxd[02128] [Originator@6876 sub=drmLogger opID=task-internal-1-7dc43530-e3] [VpxdDrmInterface::AskAndRefreshForDrmRecommendationsDo] At least 50% of cluster hosts (12 in total) are not reporting quickStats. Switching to alloc-only invocation.

I believe the above message is accompanied by an alert in vCenter for the cluster: "DRS Invocation Not Completed". When I see this message, DRS stops entirely and I need to, at a minimum, restart vCenter to get DRS to start migrating again.

I have also seen via PowerCLI inconsistent cluster stats:

PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary}

CurrentFailoverLevel : 0

AdmissionControlInfo : VMware.Vim.ClusterFailoverLevelAdmissionControlInfo

NumVmotions : 6985

TargetBalance : -1000

CurrentBalance : -1000

UsageSummary : VMware.Vim.ClusterUsageSummary

CurrentEVCModeKey : intel-haswell

DasData : VMware.Vim.ClusterDasDataSummary

TotalCpu : 855512

TotalMemory : 4944999247872

NumCpuCores : 336

NumCpuThreads : 672

EffectiveCpu : 767622

EffectiveMemory : 4542453

NumHosts : 12

NumEffectiveHosts : 12

OverallStatus : red

PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary.UsageSummary}

TotalCpuCapacityMhz : 0

TotalMemCapacityMB : 0

CpuReservationMhz : 0

MemReservationMB : 0

PoweredOffCpuReservationMhz : 0

PoweredOffMemReservationMB : 0

CpuDemandMhz : 0

MemDemandMB : 0

StatsGenNumber : -1

CpuEntitledMhz : 0

MemEntitledMB : 0

PoweredOffVmCount : 0

TotalVmCount : 0

Additionally, possibly related, I have issues powering on VMs when in this state. I presume this is due to DRS getting wedged and preventing power-on.

I'd appreciate any analysis folks could offer, including any additional places I could look to track down the source of the DRS response times as I feel that's a sure sign of trouble. Oh, I also get occasional hostd cores. So, there's a lot of things I'm trying to correlate together.

Thanks,

Tyler

internetrush · ‎08-16-2016

I'm seeing similar things here:

2016-08-16T14:15:24.328Z warning vpxd[7FAA976ED700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-313dc3b7] [DRS]NotifyDeviceModelChanges [TotalTime] took 13205 ms

2016-08-16T14:15:24.330Z warning vpxd[7FAA626CD700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-a23e30] [DRS]NotifyDeviceModelChanges [TotalTime] took 13077 ms

2016-08-16T14:15:24.383Z warning vpxd[7FAAD6949700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-6c566149] [DRS]NotifyDeviceModelChanges [TotalTime] took 13134 ms

2016-08-16T14:15:24.385Z warning vpxd[7FAAA97AF700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-5708c280] [DRS]NotifyDeviceModelChanges [TotalTime] took 13127 ms

I have no idea why this happens, but DRS seems to be working correctly so far.

All

Extremely long DRS Refresh Recommendation times, with occasional DRS failures