Storage Resource Management via PowerCLI - BCO2649

stromcooper · ‎12-04-2013

Our company has recently re-adopted a storage tiering solution, but this time with a focus on IO Density. It has promise to be quite nice and save us a lot of money, however the execs cut the monitoring tools (Watch4Net, vcOps) from the proposal while keeping the savings on disk purchase. Isn't that always the way?

BACKGROUND

Our storage engineers are already configuring the arrays with this new layout. We will have Bronze, Silver, Gold, and Platinum tiers, allowing 0.03, 0.4, 3.2 and 20 IOps per GB of provisioned disk respectively. We are coming from an environment where provisioning was done strictly based on capacity, so it really is the wild west in terms of performance of our systems. We also are a large healthcare provider, with almost 700 hosts spread across four datacenters, so no lack of scale to contend with in monitoring this either. Our goal, as virtual infrastructure admins, is to accommodate this large cost savings play without killing our operations staff. As you can imagine, having no tool and no prior performance data will make this a challenge.

The approach has storage presenting LUNs to us from a given tier, in which we'll name the datastores for what tier they are in. Datastore Clusters or Storage Profiles would be a good play for some of our non-Prod workloads, however much of our Prod workloads leverage SRM array-based replication, so which specific LUN a VMDK sits on is important. The architects also assumed we could easily move each VMDK on the performance appropriate tier of storage, further compounding the complexities but saving capital dollars. Our first concern is "What happens when a mission critical app complains of performance on a single VMDK?" With replication set LUN specific, the only performance relief we know of would be to svMotion the VMDK to a higher tier. Having >15,000 VMDKs in our primary datacenter alone, this would be a nightmare for our operations staff. As storage is configured for a specific amount of IOps per LUN, and with svMotions drawing from array-limited IOps during data moves while simultaneously servicing Prod workloads, this was untenable. Our solution was to leverage VMDK IO Limits at an advertised level, yet configure the storage frames with a slightly higher limit so that svMotions would not impact Prod. This also allows our operations staff to increase the IO limit temporarily on the VMDK and batch the svMotions for a maintenance window. It should work, but the next question is how do we monitor the VMDKs to know if they are in an appropriate tier or if they need to be pro/demoted?

We need a way to monitor each VMDK's IO Limit and how frequently it is pushing the high and low sides of it. To facilitate this, I wrote the following script that executes on a scheduled task hourly, pulls the read and write IOps for each VMDK, records count and percentage of intervals >=90% of IO limit and <=5%, then records those values in a CSV file (re-used for subsequent executions). With 15,000 VMDKs in our single largest datacenter, trying to keep multiple intervals of that granular of data just proves exhausting. The thought being this report runs hourly across the entire enterprise, could be loaded into Excel, sorted for high and low watermarks to look for re-tiering candidates and then schedule them to move during a maintenance window.

Any assistance or ideas will be greatly appreciated. I've performed as much due diligence as possible against these forums, blogs and general web searches. It seems now the best I can do is ask the experts and see if there is a way to build a better mousetrap.

PROBLEMS

Execution time - I am running this against just 400 or so VMs and it is taking 35min to generate the report. As we have >10,000 VMs in the company, we need to find a way to increase performance.
Scale - Similar to #1, any ideas on how to scale this out to be more time efficient will be appreciated. Current thoughts are one "master" script that spawns multiple "workers", possibly one for each cluster. Other ideas are to connect directly to the hosts for the data versus VC, although connecting to 700 hosts seems daunting.
Combining read and write IOps efficiently - For simplicity and better execution time, I've kept the read and write IOps separate in this revision of the script. I would love to combine the two together, based on VMDK and timestamp, but that seems expensive in terms of processing cycles and trying to stay under the 1hr realtime data window.

SCRIPT

Connect-VIServer VCServer

### Test to see if our output file exists

$OutFile = "c:\tmp\IOLimit-Rpt.csv"

if ($(Test-Path $OutFile) -like "True") {

$CSV = Import-Csv $OutFile

}

else { $CSV = "" }

### Create report of VMs, HDDs, Read & Write IOps stats per VMDK and groups like stats together so we can gather metrics

$report = @()

foreach ($VM in Get-VM) {

$VmHdds = Get-HardDisk -VM $VM | select name, capacitykb, parent, extensiondata

$Stats = Get-Stat -Entity $VM -Stat "virtualdisk.numberreadaveraged.average","virtualdisk.numberwriteaveraged.average" -Realtime | select entity, instance, metricid, value

$VmdkStats = $Stats | Group-Object -Property {$_.entity, $_.instance, $_.metricid} ### Group the 184 data points per metric per device together to determine their stats

foreach ($VmdkStat in $VmdkStats) {

$row = "" | select VmName, ScsiID, Type, Max, CapacityGB, CurrentIOLimit, IOLimit90, IOLimit5, HddName

$row.VmName = $VmdkStat.name.split(" ")[0]

$row.ScsiID = $($VmdkStat.name.split(" ")[1]).trimstart("scsi")

$VmHdd = $VmHdds | where {$_.parent.name -eq $row.VmName -and $([string]$_.extensiondata.controllerkey).substring(3,1) +":"+ $([string]$_.extensiondata.unitnumber) -eq $row.ScsiID} ### Grab the corresponding Vm Hdd object for this Stat

$row.Type = if($VmdkStat.name.split(" ")[2] -like "virtualdisk.numberreadaveraged.average") { "r" } else { "w" } ### Shrink the longer string for stat type to a simple "r" or "w"

$Values = $VmdkStat.group | select value | Measure-Object value -Maximum

$row.Max = $Values.Maximum

$row.CapacityGB = $VmHdd.CapacityKB / 1024 / 1024

$row.CurrentIOLimit = if ($VmHdd.ExtensionData.StorageIOAllocation.limit -eq $null -or $VmHdd.ExtensionData.StorageIOAllocation.limit -eq -1) {[int]($row.CapacityGB * .4)} else {$VmHdd.ExtensionData.StorageIOAllocation.limit} ### Supplies default value to compare with

$row.IOLimit90 = $($VmdkStat.Group | where {$_.value -ge ([int](.9 * $row.CurrentIOLimit))}).count ### How many data points equal or exceed 90% utilization

if ($row.IOLimit90 -eq $null) {$row.IOLimit90 = 0}

$row.IOLimit5 = $($VmdkStat.Group | where {$_.value -le ([int](.05 * $row.CurrentIOLimit))}).count ### How many data points less than or equal 5% utilization

if ($row.IOLimit5 -eq $null) {$row.IOLimit5 = 0}

$row.HddName = $VmHdd.Name

$report += $row

}

### Merge the latest Realtime report into the historic report

$report2 = @()

foreach ($record in $report) {

$row = "" | select VmName, ScsiID, Type, Max, CapacityGB, CurrentIOLimit, IOLimit90Count, IOLimit90Percent, IOLimit5Count, IOLimit5Percent, HddName, DataPoints

### If the CSV file does not exist already, we just dump this data into the file with no massaging

if ($CSV -like "") {

$row.VmName = $record.VmName; $row.ScsiID = $record.ScsiID; $row.Type = $record.Type; $row.Max = $record.Max; $row.CapacityGB = $record.CapacityGB; $row.CurrentIOLimit = $record.CurrentIOLimit; $row.IOLimit90Count = $record.IOLimit90; $row.IOLimit90Percent = ($record.IOLimit90 / 184); $row.IOLimit5Count = $record.IOLimit5; $row.IOLimit5Percent = ($record.IOLimit5 / 184); $row.HddName = $record.HddName; $row.DataPoints = 184

}

### The CSV file esists, so we need to find the prior record so the new data can be added

else {

$OldRecord = $CSV | where {$_.VmName -like $record.VmName -and $_.ScsiID -like $record.ScsiID -and $_.Type -like $record.Type}

### As our tiering is based on IO Density (CapacityGB x IOps allowed for a given tier), if CapacityGB or the IOLimit change, we archive the old data and start a fresh row

if ([single]$record.CapacityGB -ne [single]$OldRecord.CapacityGB -or $record.CurrentIOLimit -ne $OldRecord.CurrentIOLimit) {

### Write old data to archive file

"`"{0}`",`"{1}`",`"{2}`",`"{3}`",`"{4}`",`"{5}`",`"{6}`",`"{7}`",`"{8}`",`"{9}`",`"{10}`",`"{11}`",`"{12}`"" -f $OldRecord.vmname,$OldRecord.ScsiID,$OldRecord.Type,$OldRecord.Max,$OldRecord.CapacityGB,$OldRecord.CurrentIOLimit,$OldRecord.IOLimit90Count,$OldRecord.IOLimit90Percent,$OldRecord.IOLimit5Count,$OldRecord.IOLimit5Percent,$OldRecord.HddName,$OldRecord.DataPoints,$endDTM | Add-Content -path c:\tmp\IOLimit-Archive.csv

### Writes new stat data into CSV file

$row.VmName = $record.VmName; $row.ScsiID = $record.ScsiID; $row.Type = $record.Type; $row.CapacityGB = $record.CapacityGB; $row.CurrentIOLimit = $record.CurrentIOLimit; $row.HddName = $record.HddName

$row.Max = $record.Max

$row.IOLimit90Count = [int]$record.IOLimit90

$row.IOLimit90Percent = $row.IOLimit90Count / 184

$row.IOLimit90Percent = "{0:P2}" -f $row.IOLimit90Percent ### Puts it in % format versus decimal

$row.IOLimit5Count = [int]$record.IOLimit5

$row.IOLimit5Percent = $row.IOLimit5Count / 184

$row.IOLimit5Percent = "{0:P2}" -f $row.IOLimit5Percent ### Puts it in % format versus decimal

$row.DataPoints = 184

}

### Assuming no change in capacity or IO limit, compares metrics between old and current records

else {

$row.VmName = $record.VmName; $row.ScsiID = $record.ScsiID; $row.Type = $record.Type; $row.CapacityGB = $record.CapacityGB; $row.CurrentIOLimit = $record.CurrentIOLimit; $row.HddName = $record.HddName

$row.Max = if ($OldRecord.Max -gt $record.Max) {$OldRecord.Max} else {$record.Max}

$row.DataPoints = [int]$OldRecord.DataPoints + 184

$row.IOLimit90Count = [int]$OldRecord.IOLimit90Count + [int]$record.IOLimit90

$row.IOLimit90Percent = ([int]$OldRecord.IOLimit90Count + [int]$record.IOLimit90) / $row.DataPoints

$row.IOLimit90Percent = "{0:P2}" -f $row.IOLimit90Percent ### Puts it in % format versus decimal

$row.IOLimit5Count = [int]$OldRecord.IOLimit5Count + [int]$record.IOLimit5

$row.IOLimit5Percent = ([int]$OldRecord.IOLimit5Count + [int]$record.IOLimit5) / $row.DataPoints

$row.IOLimit5Percent = "{0:P2}" -f $row.IOLimit5Percent ### Puts it in % format versus decimal

}

$report2 += $row

}

$report2 | Export-Csv $OutFile -NoTypeInformation

Disconnect-VIServer VCServer -Confirm:$false -Force:$true

SAMPLE OUTPUT

Definitions

IOLimit90 - How many data point are >= 90% of the CurrentIOLimit

IOLimit5 - How many data points are <= 5% of the CurrentIOLimit

VmName	ScsiID	Type	Max	CapacityGB	CurrentIOLimit	IOLimit90Count	IOLimit90Percent	IOLimit5Count	IOLimit5Percent	HddName	DataPoints
VM1	0:00	w	7	20	6	76	41.30%	0	0.00%	Hard disk 1	184
VM1	0:01	w	0	8	30	0	0.00%	180	97.83%	Hard disk 2	184
VM1	0:00	r	0	20	6	0	0.00%	180	97.83%	Hard disk 1	184
VM1	0:01	r	0	8	30	0	0.00%	180	97.83%	Hard disk 2	184
VM2	0:00	w	158	20	8	3	1.63%	35	19.02%	Hard disk 1	184
VM2	0:01	w	1	8	3	0	0.00%	176	95.65%	Hard disk 2	184
VM2	0:00	r	28	20	8	0	0.00%	179	97.28%	Hard disk 1	184
VM2	0:01	r	0	8	3	0	0.00%	180	97.83%	Hard disk 2	184

LucD · ‎12-04-2013

At first glance, I would suggest to replace the Get-Stat call you do for each VM with 1 Get-Stat call for all VMs.

After this Get-Stat you can split the result with the Group-Object cmdlet on the $_.Entity.Name property.

You would probably also win on execution time by doing the Get-Harddisk only once for all the VMs.

If you store the results in a hash table, you can easily look up the correct entry based on the VM name and the Harddisk name.

Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference

stromcooper · ‎12-05-2013

Thanks for the tips. I'll play around with the Get-Stat.

As for the hash table on Get-Harddisk, you mention it would be based on VM name and Harddisk name. I'm unfamiliar with how to create a hash table with a combined key like that, as all the examples I can find are based on one value. Could you provide a snippet on how to build a hash table using a combined key?

stromcooper · ‎08-01-2014

I've added a second script for subsequent execution, for when I want to tier grade a given VMDK's performance. It has our company's IO density levels baked-in, but the values are easily changed for others. I'm also only keeping the higher of the read or write IOps values in the final report, as that is what we provision on.

$CSV = Import-Csv c:\tmp\IOLimit-RT-XrdcAug.csv

foreach ($CSVrow in $CSV) {

$IODensity = $CSVrow.Max / $CSVrow.CapacityGB

if ($IODensity -le .03) {$Tier = "Bronze"}

elseif ($IODensity -le .4) {$Tier = "Silver"}

elseif ($IODensity -le 3.2) {$Tier = "Gold"}

elseif ($IODensity -le 20) {$Tier = "Platinum"}

else {$Tier = "Diamond"}

Add-Member -InputObject $CSVRow -MemberType NoteProperty -Name "IODensity" -Value $IODensity -Force

Add-Member -InputObject $CSVRow -MemberType NoteProperty -Name "Tier" -Value $Tier -Force

}

$CsvGrouped = $CSV | Group-Object -Property {$_.VmName, $_.ScsiID}

$report = @()

foreach ($CsvGroupedRow in $CsvGrouped) {

if ($CsvGroupedRow.Group[0].Max -gt $CsvGroupedRow.Group[1].Max) {$Report += $CsvGroupedRow.Group[0]}

else {$Report += $CsvGroupedRow.Group[1]}

}

$report | Export-Csv c:\tmp\Xrdc-IOReport-Aug.csv -NoTypeInformation