VMware Cloud Community
Tyler14
Contributor
Contributor

Extremely long DRS Refresh Recommendation times, with occasional DRS failures

I have a fairly large configuration currently, but have also seen similar issues with slightly smaller configurations. The config I have now is a 12 node cluster, running ESXi6 and vCenter 6. I have DRS and HA enabled, and "should" rules for every VM. I have about 2000 VMs.

I'm seeing, from what I gather, are extremely long times to collect and process the available resources across the cluster. From the vpxd-log:

vpxd-13.log:2016-06-22T10:19:32.471-04:00 warning vpxd[08328] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-45] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 2739 ms

vpxd-13.log:2016-06-22T10:20:05.283-04:00 warning vpxd[08592] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-7d] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3938 ms

vpxd-13.log:2016-06-22T10:20:44.290-04:00 warning vpxd[09268] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-3e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 6140 ms

vpxd-13.log:2016-06-22T10:21:54.128-04:00 warning vpxd[07136] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d9] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 15938 ms

vpxd-13.log:2016-06-22T10:22:44.581-04:00 warning vpxd[02100] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-d0] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 4271 ms

vpxd-13.log:2016-06-22T10:23:46.249-04:00 warning vpxd[09148] [Originator@6876 sub=VpxProfiler opID=task-internal-1-7dc43530-5e] Time taken for DRS region [AskAndRefreshDrmRecommendations] took 3618 ms


I'm also occasionally seeing, especially during the >>10 second delays:

warning vpxd[02128] [Originator@6876 sub=drmLogger opID=task-internal-1-7dc43530-e3] [VpxdDrmInterface::AskAndRefreshForDrmRecommendationsDo] At least 50% of cluster hosts (12 in total) are not reporting quickStats.  Switching to alloc-only invocation.

I believe the above message is accompanied by an alert in vCenter for the cluster: "DRS Invocation Not Completed". When I see this message, DRS stops entirely and I need to, at a minimum, restart vCenter to get DRS to start migrating again.

I have also seen via PowerCLI inconsistent cluster stats:

PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary}

CurrentFailoverLevel : 0

AdmissionControlInfo : VMware.Vim.ClusterFailoverLevelAdmissionControlInfo

NumVmotions          : 6985

TargetBalance        : -1000

CurrentBalance       : -1000

UsageSummary         : VMware.Vim.ClusterUsageSummary

CurrentEVCModeKey    : intel-haswell

DasData              : VMware.Vim.ClusterDasDataSummary

TotalCpu             : 855512

TotalMemory          : 4944999247872

NumCpuCores          : 336

NumCpuThreads        : 672

EffectiveCpu         : 767622

EffectiveMemory      : 4542453

NumHosts             : 12

NumEffectiveHosts    : 12

OverallStatus        : red

PowerCLI C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI> Get-Cluster -name Dozen | get-view| %{$_.Summary.UsageSummary}

TotalCpuCapacityMhz         : 0

TotalMemCapacityMB          : 0

CpuReservationMhz           : 0

MemReservationMB            : 0

PoweredOffCpuReservationMhz : 0

PoweredOffMemReservationMB  : 0

CpuDemandMhz                : 0

MemDemandMB                 : 0

StatsGenNumber              : -1

CpuEntitledMhz              : 0

MemEntitledMB               : 0

PoweredOffVmCount           : 0

TotalVmCount                : 0

Additionally, possibly related, I have issues powering on VMs when in this state. I presume this is due to DRS getting wedged and preventing power-on.

I'd appreciate any analysis folks could offer, including any additional places I could look to track down the source of the DRS response times as I feel that's a sure sign of trouble. Oh, I also get occasional hostd cores. So, there's a lot of things I'm trying to correlate together.

Thanks,

Tyler

0 Kudos
1 Reply
internetrush
Contributor
Contributor

I'm seeing similar things here:

2016-08-16T14:15:24.328Z warning vpxd[7FAA976ED700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-313dc3b7] [DRS]NotifyDeviceModelChanges [TotalTime] took 13205 ms

2016-08-16T14:15:24.330Z warning vpxd[7FAA626CD700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-a23e30] [DRS]NotifyDeviceModelChanges [TotalTime] took 13077 ms

2016-08-16T14:15:24.383Z warning vpxd[7FAAD6949700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-6c566149] [DRS]NotifyDeviceModelChanges [TotalTime] took 13134 ms

2016-08-16T14:15:24.385Z warning vpxd[7FAAA97AF700] [Originator@6876 sub=VpxProfiler opID=HB-host-1696@11319-6da279df-[DRS]NotifyDeviceModelChanges-5708c280] [DRS]NotifyDeviceModelChanges [TotalTime] took 13127 ms

I have no idea why this happens, but DRS seems to be working correctly so far.

0 Kudos