Unsually high %RDY time on a fairly uncommitted ho...

jrmunday · ‎11-28-2013

Hi All,

I am seeing some very unsual %RDY time values on my Dell R815 hosts and wondered if anyone else with the same (or similar) hardware has seen this? The Dell R815 have 4x 16 core AMD Opteron(tm) Processor 6276 (64 logical cores) and are running ESXi 5.0 build 1311175.

I'm always suspect of these hosts as the Dell hardware has been so much less reliable than any of the HP hardware, but I want to be open minded to why this is happenning and either explain or resolve it. In the past month, the hosts have been upgraded to ESXi 5.0 build 1311175 and had BIOS / firmware updates (so this could be a factor) - currently running BIOS version 3.11.

The hosts where I see this behavior have a very low resource allocation, so there is no evidence of overprovisioning that would contribute to this behavior (there are generally less than 64 vCPU's in total, which is less than the total # of logical cores).

As an example, one host with only 4x VM's (with 16 vCPU's in total) shows excessive ready time on some VM's. This doesn't happen all the time, but I would not expect these spikes in %RDY time (some above 40%) with such low utilisation;

Example of host loads (I moved VM's off the first one for testing, but they are generally well ballanced);

10:52:38am up 20:23, 456 worlds, 4 VMs, 16 vCPUs; CPU load average: 0.07, 0.08, 0.07

10:52:01am up 22 days 20:49, 618 worlds, 27 VMs, 84 vCPUs; CPU load average: 0.22, 0.18, 0.20

10:51:36am up 22 days 20:04, 566 worlds, 17 VMs, 65 vCPUs; CPU load average: 0.11, 0.11, 0.11

I could roll back to an earlier BIOS version (and build version) for testing / observation, but I'm keen to find out what I should be checking in the existing environment to understand this behavior?

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

MKguy · ‎11-28-2013

First two things you should make sure of:

1. Can you check if the VMs are balanced across multiple NUMA nodes? Add the NUMA counters in the (r)esxtop memory view to check this.

Maybe the ESXi CPU scheduler put them all or most on a single NUMA node, so all vCPUs effectively run on only one CPU and it's 16 cores. This used to be an issue on older 4/5.x builds but it should be resolved since ~5.0 U1 already iirc.

2. Disable CPU Power Saving options in the Server BIOS and/or set the ESXi power profile to high performance.

-- http://alpacapowered.wordpress.com

jrmunday · ‎11-28-2013

Hi MKguy,

I did notice a NUMA imbalance yesterday (some only 34% local), despite the fact that the largest VM fits withing a single NUMA node.

I rebooted the host as a precuationary measure, and this made some difference. Most are 99% local, but there are still a few that are much lower (this seems to change over time).

Examples;

11:21:50am up 20:52, 456 worlds, 4 VMs, 16 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM /MB: 262118   total: 2736     vmk, 36851 other, 222529 free
VMKMEM/MB: 261732 managed: 3231 minfree, 9850 rsvd, 251882 ursvd, high state
NUMA /MB: 32756 (21351), 32768 (21404), 32768 (33060), 32768 (33081), 32768 (32534), 32768 (32465), 32768 (32445), 32752 (15696)
PSHARE/MB:   553 shared,    92 common:   461 saving
SWAP /MB:     0    curr,     0 rclmtgt:                 0.00 r/s,   0.00 w/s
ZIP   /MB:     0 zipped,     0   saved
MEMCTL/MB:     0    curr,     0 target, 15729 max

11:18:50am up 22 days 21:16, 624 worlds, 27 VMs, 84 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM /MB: 262118   total: 2931     vmk,148544 other, 110641 free
VMKMEM/MB: 261732 managed: 3231 minfree, 25341 rsvd, 236390 ursvd, high state
NUMA /MB: 32756 ( 4802), 32768 (22163), 32768 ( 6449), 32768 (10208), 32768 (12513), 32768 (14157), 32768 (24945), 32752 ( 4207)
PSHARE/MB: 74604 shared, 15864 common: 58740 saving
SWAP /MB:     0    curr,     0 rclmtgt:                 0.00 r/s,   0.00 w/s
ZIP   /MB:     0 zipped,     0   saved
MEMCTL/MB:     0    curr,     0 target, 125042 max

11:22:18am up 22 days 20:35, 566 worlds, 17 VMs, 65 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM /MB: 262118   total: 2855     vmk,110454 other, 148807 free
VMKMEM/MB: 261732 managed: 3231 minfree, 21907 rsvd, 239824 ursvd, high state
NUMA /MB: 32756 (18580), 32768 (14367), 32768 ( 1246), 32768 (14775), 32768 (15073), 32768 (32391), 32768 (23475), 32752 (22064)
PSHARE/MB: 27919 shared, 5715 common: 22204 saving
SWAP /MB:     0    curr,     0 rclmtgt:                 0.00 r/s,   0.00 w/s
ZIP   /MB:     0 zipped,     0   saved
MEMCTL/MB:     0    curr,     0 target, 67763 max

I put together a powershell script to collect some allocaton information (see below). Interestingly, the 16 core proces only show up as 8 nodes, so these must be 2x 8cores stuck together to make a 16x core proc - wonder if this is affecting it.

I don't see the same behavior on my HP servers and can influece %RDY time in a controlled set of tests (as expected) ... it's just these Dell R815's that seem to be problematic?

With regards to BIOS settings, I have the all set to MAX performance and disable C1E.

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

MKguy · ‎11-28-2013

I did notice a NUMA imbalance yesterday (some only 34% local), despite the fact that the largest VM fits withing a single NUMA node.

I rebooted the host as a precuationary measure, and this made some difference. Most are 99% local, but there are still a few that are much lower (this seems to change over time).

Memory locality is less of a concern in regards to CPU %RDY. What matters is how many vCPUs are scheduled on the same NUMA home node (NHN in esxtop).

If I interpret your statistic correctly, it seems like you have some VMs with 8 vCPUs on your hosts (vCPU-MAX). This would mean a single VM can occupy a whole NUMA node by itself. So if other vCPUs are being scheduled on the same NUMA node (and thus the same physical CPU cores), they begin to compete for CPU timeslices and %RDY can increase. Though a significant increase usually only occurs if the CPU utilization of these VMs is quite high.

So check the distribution of your VMs across NUMA home nodes first.

And just to make sure: You don't have any CPU limits on these VMs, right?

This statistic seems really useful btw. Have you posted the PowerCLI snippet anywhere or do you mind sharing it?

-- http://alpacapowered.wordpress.com

jrmunday · ‎11-28-2013

Thanks, I 'll check the home nodes and let you know what I find.

The script is quite raw, and can definitely be improved as I rushed it just to get the information that I needed ... Feel free to use / change it as you need;

## Clear

Clear-Host

## Get Start Time

$starttime = (get-date)

## Create Empty Arrays

$MyCol = @()

$results_numa = @()

$collect_numa = @()

$results_hwm = @()

$collect = @()

## Connect to vCenter

Connect-VIServer vcenter.fqdn

## Get Hosts that you need

$hsts = Get-Datacenter DC* | Get-VMHost * | Sort-Object Name

## Collect information for each host

Foreach ($hst in $hsts){

$esxcli = Get-EsxCli -vmhost $hst.Name

$clst = Get-VMHost $hst.Name | Get-Cluster

$VMCPUalloc = Get-VMHost $hst.Name | Get-VM | Measure-Object -Property NumCpu -Sum -Average -Maximum -Minimum

$esxcli_hwm = (

$esxcli.hardware.memory.get() |

Add-Member -MemberType NoteProperty -Name HostName -Value $hst.Name -PassThru) ;

$esxcli_hc = (

$esxcli.hardware.cpu.list() |

Add-Member -MemberType NoteProperty -Name HostName -Value $hst.Name -PassThru) ;

$CPU_CoreSpeed = $esxcli_hc | Measure-Object -Property CoreSpeed -Average

$CPU_Node = $esxcli_hc | Measure-Object -Property Node -Maximum

$collect_numa = New-Object -TypeName PSObject -Property @{

VMHost = $esxcli_hwm.VMHost

PhysicalMemory = [math]::Round(($esxcli_hwm.PhysicalMemory /1024 /1024 /1024),2)

NUMANodeCount = $esxcli_hwm.NUMANodeCount

NUMALocal = [math]::Round(([long]$esxcli_hwm.PhysicalMemory /1024 /1024 /1024),2) / [int]$esxcli_hwm.NUMANodeCount

}

$collect = New-Object -TypeName PSObject -Property @{

Host = $hst.Name

Cluster = $clst.Name

Manufacturer = $hst.Manufacturer -replace (" Inc.","")

Model = $hst.Model

'CPU-Type' = $hst.ProcessorType -replace (" Processor "," ")

pCPU = $hst.NumCpu

TotalVM = $VMCPUalloc.Count

vCPU = "{0:D0}" -f [int]$VMCPUalloc.Sum

'vCPU-RATIO' ="{0:P0}" -f ([int]$VMCPUalloc.Sum / [int]$hst.NumCpu)

'vCPU-MAX' = "{0:D0}" -f [int]$VMCPUalloc.Maximum

'vCPU-MIN' = "{0:D0}" -f [int]$VMCPUalloc.Minimum

'vCPU-AVG' = $VMCPUalloc.Average

pRAM = "{0:D0}" -f [int]$collect_numa.PhysicalMemory

'NUMA-NODES' = $collect_numa.NUMANodeCount

'NUMA-MEM' = "{0:D0}" -f [int]$collect_numa.NUMALocal

'NUMA-CPU' = "{0:D0}" -f [int]$hst.NumCpu / [int]$collect_numa.NUMANodeCount

'CPU-GHz' = [math]::round(([long]$CPU_CoreSpeed.Average / 1000000000), 1)

'CPU-Node' = ($CPU_Node.Maximum + 1)

}

$MyCol += $collect

}

## Output results to a Grid View

$MyCol | Select Cluster, Host, Manufacturer, Model, CPU-Type, CPU-GHz, CPU-Node, pCPU, vCPU-RATIO, TotalVM, vCPU, vCPU-MIN, vCPU-AVG, vCPU-MAX,pRAM,NUMA-NODES,NUMA-MEM,NUMA-CPU | Sort-Object Cluster, Name | Out-GridView #Format-Table -AutoSize

## Disconenct from vCenter

Disconnect-VIServer -Server * -Force -Confirm:$false

## Calculate Execution Time

$endtime = (get-date)

$timeConvert = New-TimeSpan -Seconds $(($endTime-$startTime).totalseconds)

$elapsedTime = '{1:00}m:{2:00}s' -f $timeConvert.Hours,$timeConvert.Minutes,$timeConvert.Seconds

write-host "Script took $elapsedTime to complete"

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

All

Unsually high %RDY time on a fairly uncommitted host - Dell R815