cnidus
Contributor
Contributor

[Script / Code sample Request] Get Total Network throughput for given set of ESXi hosts or pNICs

Jump to solution

Hi all,

I posed this question to Luc Dekens (@LucD22) via twitter earlier and he suggested posting it up here... good suggestion.

Scenario:

We have a set of ESX hosts running on blade hardware, currently using dual pass-through 10g out to Top Of Rack Juniper EX4500 switches. I'm proposing with our next design to use in-chassis Force10 MXL switches and uplink to a distributed spine (either Lagg'd 10g or 40g). There's been some resistance from network team claiming we need full speed to all the blades, all the time...... I have a strong suspicion our overall throughput for a given chassis is far less than 320gbit..... but I'd like to prove that with some solid data.

We've got monitoring, but they all use 5minute averages (at best), so transient peaks tend to be missed (and I'd like to capture them if possible).

What I'd like to do:

Generate a set of data that I can use to create a stacked line graph (Each Host OR pNIC would be a line, with total of all of them at each datapoint too). Probably should separate Inbound and Outbound traffic...

I'd like the data to be as granular as possible, so capturing the 20sec averages, calculating a Maximum of the averages over a period would be ideal for each host. Thinking say a datapoint every 30minutes, over a period of X (likely the length of time the script is running allowed to run for?), maybe a failsafe of (X period, configurable?)

However, for the total, it should probably sum each collection point, then calc the max of those for each datapoint (so as to not capture two peaks that don't occur simultaneously, skewing the data.)

It may be worth splitting to a new CSV file every day / some-other-small-period, so long running

Other ways of doing this:

Crank up the stats levels to get the "maximum" values in vCenter at the larger . Ideally not..... it puts reasonable amount of strain on the vCenter DB, plus it will produce skewed data when combining Max values.

Thoughts?
Youd, Douglas

Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
1 Solution

Accepted Solutions
LucD
Leadership
Leadership

You're welcome, and as long as there is no "code spell checker" I will keep making typos :smileygrin:


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

View solution in original post

0 Kudos
11 Replies
LucD
Leadership
Leadership

The way that the vSphere PerformanceManager captures and aggregates the different counters is key in this.

The 20 second interval is only available in the so-called Realtime interval.

The drawback, that data is kept for +/- 1 hour on the ESXi node itself.

See my PowerCLI & vSphere statistics – Part 1 – The basics post for some info on intervals.

What I would propose, and what you were apparently considering as well, is to capture the Realtime data every 30 minutes (this can be done through a Scheduled Task on a Windows box that has PowerCLI installed).

You could you consider storing the data in a SQL Express DB instead of separate CSVs.

With another script you could then read the data on a regular interval and produce the reports and graphs.

I wouldn't advocate a change in the Statistics Level to capture the Maximum counter.

It will make your vCenter database bigger and most probably slower.

Where are you stuck right now; on the counters to select or how to do the Get-Stat call ?


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

cnidus
Contributor
Contributor

Hi Luc,

Thanks for the prompt reply.

Using a SQL DB is not bad a idea, but maybe a little overkill for my needs, although it if were running longer it might be worthwhile. Realistically, I'm probably going to run it for a week, so that's only 7x24x2 records to store in total, a CSV should do I guess.

If I'm understanding correctly, do you think doing a point-in-time calc based on the last 30min of data is the way to go for the data-collect.ps1 script, then just schedule that to run.

I really wanted to get an opinion on what Im trying to do is a good way of doing it (which it sounds like it is), and maybe some sample code. I think I can figure out which counters, if you have a sample get-stat command handy that may be helpful. The logic I should be able to figure out....

So, I'll have a crack.

Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
LucD
Leadership
Leadership

As always, "ït depends" what you are trying to show/prove/learn.

Let's assume you have 30 minutes of data that is collected on 20 second intervals.

If you just calculate the maximum value, then that could be representing different situations:

For example, have a look at these 2 extremes

1) all your values are 10, and you have 1 maximum of 20 on 900 data points

2) all your 900 data points have a value of 20

To discover that these are entirely different situations, you could also include the average. And perhaps even the standard deviation.

So your 900 values become 3 values. Nice reduction of data, but at the cost of loosing detail.

Ultimately, you could decide to store all data points, and for example plot them in a curve.

That will give the best representation of what actually happened.

But don't forget that the value for the 20 seconds interval is already an aggregate or an average.

And 20 seconds is a lifetime in the IT world.

Every sampling, by definition, throws away granularity.

I know, this was not really an answer on what you should do.

But perhaps it gave you enough food for thought to consider what would be the best solution in your case and with your requirements 😉


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
cnidus
Contributor
Contributor

It depends..... always.

I think for my purposes, for each 30min dataset I'll calculate the following:
1) Totals: ForEach 20sec interval: Sum all Nic0 TX, Sum all Nic1 TX, Sum all Nic0 RX, Sum all Nic1 RX
2) The max and average of the Totals dataset, for Nic0/1 TX/RX

3) Each Host: MaxNic0TX, MaxNic0RX, AvgNic0TX, AvgNic0RX, MaxNic1TX, MaxNic1RX, AvgNic1TX, AvgNic1RX,

For what I'm trying to achieve (justify, or not) some level of uplink overallocation at the chassis boundaries.... that should suffice.

End of the day.... all that traffic has to be going somewhere and there's only a few upstream place it could be going, the rest will be inter-blade.... which I'm less concerned about. SO something quick and dirty like this will be good enough to indicate if I'm in the right ballpark or not.... I'd be really surprised to see some 100+ gbit peaks, but stranger things have happened.

Anywho, I've hacked together a few samples I've worked with in the past, and one of yours (Get the maximum IOPS | LucD notes)... but now I'm a bit lost in working with the PSObject. Any chance you could point me in the right direction? Perhaps there's a cleaner way to do that.

Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
LucD
Leadership
Leadership

Sure.

Couple of questions:

- each host has 2 NICs I gather ?

- How should the result look ? A table somewhat like this ?

CollectionTime  Host  NIC0AvgRx  NIC1AvgRx  NIC0Avgtx  NIC1AvgTx

Where CollectionTime would be the start of the 30 minute interval ?


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
cnidus
Contributor
Contributor

Yep. Dual 10g NICs.

Table just like that, with the Max values too. I was thinking the total could just be an instance of Host too, so its a single flat table which is easy to manipulate later.

CollectionTime  Host  NIC0AvgRx  NIC1AvgRx  NIC0AvgTx  NIC1AvgTx  NIC0MaxRx  NIC1MaxRx  Nic0MaxTx  Nic1MaxTx
Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
LucD
Leadership
Leadership

Try it like this

$esx = Get-VMHost
$stat = "net.received.average","net.transmitted.average"
$start = (Get-Date).AddMinutes(-30)

Get-Stat -Entity $esx -Stat $stat -Start $start -Realtime |
Group-Object -Property {$_.Entity.Name} | %{
 
$record = New-Object PSObject -Property @{
   
CollectionTime = $_.Group[0].Timestamp
   
Host =  $_.Name
  }
 
$_.Group | where {$_.Instance -match "vmnic"} |
   
Group-Object -Property Instance | %{
   
$RxValue = $_.group | where {$_.MetricId -eq "net.received.average"} |
     
Measure-Object -Property Value -Average -Maximum
   
$TxValue = $_.group | where {$_.MetricId -eq "net.transmitted.average"} |
     
Measure-Object -Property Value -Average -Maximum
   
Add-Member -InputObject $record -Name ($_.Name + "AvgRx (KBps)") -Value ([math]::Round($RxValue.Average,1)) -MemberType NoteProperty
   
Add-Member -InputObject $record -Name ($_.Name + "AvgTx (KBps)") -Value ([math]::Round($TxValue.Average,1)) -MemberType NoteProperty
   
Add-Member -InputObject $record -Name ($_.Name + "MaxRx (KBps)") -Value ([math]::Round($RxValue.Maximum,1)) -MemberType NoteProperty
   
Add-Member -InputObject $record -Name ($_.Name + "MaxTx (KBps)") -Value ([math]::Round($RxValue.Maximum,1)) -MemberType NoteProperty
  }
 
$record
}


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

cnidus
Contributor
Contributor

Something appears to be off in the math. Max and averages appear to be about half what I can see in vCenter for the same period.

Any chance you can briefly describe the logic / loops above and I'll have a go at troubleshooting the discrepancy.

Plus, it doesn't appear to store or calculate any aggregated peaks across multiple hosts. But I can work that out from this point.

Cheers, I'll have a crack at modifying it to suit.

Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
LucD
Leadership
Leadership

Strange, I just did some tests, and the numbers I get seem to correspond reasonably.

g2.png

g1.png

For the logic behind the script, let's first try with this pic (1 pic -gt 1000 words :smileygrin:)

If there are specific points that are not clear, feel free to ask

g3.png


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
cnidus
Contributor
Contributor

Ok.

I think I've got something close to reasonable now.

I was looking at the Max values more so, You had a typo and were referencing average for the max calcs. Good to know even the pro's make mistakes Smiley Happy

Thanks for your help @lucd . Much appreciated. 

Douglas Youd Senior Virtualization Engineer zettagrid
0 Kudos
LucD
Leadership
Leadership

You're welcome, and as long as there is no "code spell checker" I will keep making typos :smileygrin:


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos