Wondering if someone else has run across this one?
Got a sizeable environment of about 80 hosts that we recently updated from some build of Update 1 (not sure which) to Update 2 (Build 13981272 to be exact). No problems early on, but as we upgraded more hosts, we started getting storage alarms about High I/O Latency to VMs. After lots of digging, it looks like my path policy was reset from Round Robin to Most Recently Used.
I use the following command in my deployments to set the default path policy for my ALUA disks:
esxcli storage nmp satp set -s VMW_SATP_ALUA -P VMW_PSP_RR
It's just like that config disappeared in my upgrade. Really just wondering if other folks are seeing this as well, or if I have something else in my environment I should track down. This is the only change I'm aware of that has been made to the hosts. But maybe it will help someone else out there if they're seeing it too...
Message was edited by: Schroeber - Changed title of thread to indicate problem is with 3PAR Disks
Think I figured out my issue..
Looks like 6.7 Update 2 introduced a new SATP Policy for 3PAR Disks (which is what I'm using):
Why they set the path policy to MRU, I have no idea. But in previous versions, my systems used the generic VMM_SATP_ALUA policy, which I would modify with the previously mentioned command. But now that a more-specific rule applies, my disks are following that. And with that, my disk I/O is sucking...
I've tried everything I can to modify that rule, but no luck so far. I just get errors basically saying I can modify the default claim rule. So I now have an open case with support to see what I need to do. Once I hear back I'll update this thread. Hope I hear from support soon...
I had the same issues on earlier ESXi versions, do your volumes get the following tuning as advised by HPE ? :
(Sometimes there's this one in addition "queue-depth 128" but it's not advised to change it if not needed...)
You can check it with the following command :
esxcli storage core device list
It will give you the list of all your volumes with associated parameters, then you can take a look at the QueueFull Sample Size, Queue Full Threshold and Device Max Queue Depth if needed.
If the value is different from those advised by HPE, here's the command to set the value for one volume (test with only one or two volumes first, then if it works, apply it to all) :
esxcli storage core device
--device device_name --queue-full-threshold 4 --queue-full-sample-size 32
Finally, have you tried to modify the PSP Options of your current rule to "1" ?
The defaults settings is 1000, which means the device switch the next path after 1000 I/Os, HPE/3PAR advise to set it to 1.
esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -P “VMW_PSP_RR” -O “iops=1” -c “tpgs_on” -V “3PARdata” -M “VV” -e “HP 3PAR Custom Rule”
Let me know if it improves the situation
Source (page 37/172 of Hpe 3PAR StoreServ VMware ESX/ESXi Implementation Guide.pdf) : https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjAzd_xnvDjAhVr5eAKHdRxBQ8...