We purchased new Dell R760 with 7 onboard NVMe SSD drives that are setup in a RAID 5 on a PERC H965i.
The server are setup with the custom Dell ISO of ESXi 8.0.1 build-21813344.
I was setting up a new windows server to act as a proxy for veeam and notice the server just hung up for around 15 mins (which might not be related to the issue i am posting about). So I went out to see if there was something going on using esxtop. Everything was fine except for the storage area. There was barely any activity happening with the windows server as far as disk activity went but the latency numbers are a real head scratcher.
This is one sampling and this just randomly happens with 1 VM setup.
KAVG/cmd: -136383.89 (Yes that is negative)
I downloaded iometer to do a load test of the storage and the numbers showed what i would expect.
I have opened a ticket with vmware but wanted to ask the community if anyone has seen anything like this.
Also I was on the phone with dell pro support for about 3 hours and they wanted me to call vmware since they could not find anything.
All drivers and firmware on the storage are up to date.
I have not put this server into production yet until an answer can be found out. My fear is I will move over servers and there will be an issue.
Found out some interesting things !
Same problem with high latency on large blocksizes exists on H755N with Firmware 52.21.1-5149 released on 22.09.23 too.
On Firmware 52.21.0-4606 everything works great ! Maybe you can give a try and downgrade the firmware of H965i to 188.8.131.52.18-86 too ? Maybe Dell has implemted fixes for PERC11 and PERC12 for latest builds and same problem occurs on latest firmware for both controller.
Thanks. I downgraded and it still does the same thing with latency.
I did get this from vmware on my specific issue.
The Engineering team have shared the update that:
“We have been actively debugging this issue, but looks to be a tricky one. We have added debug logs from where the stats are fetched, but we see no anomalies there, yet esxtop reports high and negative stats sometimes. We have not yet root caused the issue, debug is still in progress.”
I will keep you updated with the progress.
Can you confirm this issue occurs with RAID 1 or RAID 10 and not just RAID 5? Everyone I've seen thus far reporting this issue were using a RAID 5 configuration. Looking at buying an R760 and now considering picking up the PERC 11 version for now. Though not sure how easy it would be to upgrade in the future.
In my situation it happens on the BOSS Card in a RAID 1 and the NVMe RAID 5.
My personal thought is it is the NVMe part of the setup that is causing the issue but I cannot confirm this until Dell replaces my configuration with a SAS SSD setup or something else.
Dell is currently working on something with it, i will let you know what i find out.
I am also see this latency issue. We just implemented 4 R760's running VMware ESXi, 8.0.1, 22088125 dell custom ISO A04 with vSAN 8 ESA. Using BOSS in RAID 1 for OS and we have 6 nvme drives at 3.49TB per host. I see high latency on the storage path. Disk latency looks to be very low as indicated in this thread. HBA reporting very high latency in the hundreds of thousands, as high as 500,000. Looks like I will be transfering 21TB's back to the old system.
Storage path from one of the R760's. All hosts have high numbers
I thought after I posted this that we are running vSAN so we wouldn't need a raid controller so I went into inventory and all we have controller wise:
405-AACD : No Controller
403-BCRU : BOSS-N1 controller card + wit h 2 M.2 480GB (RAID 1) for the OS.