We purchased new Dell R760 with 7 onboard NVMe SSD drives that are setup in a RAID 5 on a PERC H965i.
The server are setup with the custom Dell ISO of ESXi 8.0.1 build-21813344.
I was setting up a new windows server to act as a proxy for veeam and notice the server just hung up for around 15 mins (which might not be related to the issue i am posting about). So I went out to see if there was something going on using esxtop. Everything was fine except for the storage area. There was barely any activity happening with the windows server as far as disk activity went but the latency numbers are a real head scratcher.
This is one sampling and this just randomly happens with 1 VM setup.
CMD/s: 626.54
READS/s: 622.37
WRITES/s: 4.16
MBREAD/s: 2.57
DAVG/cmd: 136384.03
KAVG/cmd: -136383.89 (Yes that is negative)
GAVG/cmd: .014
QAVG/cmd: 142262.59
I downloaded iometer to do a load test of the storage and the numbers showed what i would expect.
CMD/s: 225445.98
READS/s: 113026.56
WRITES/s: 112419.42
MBREAD/s: 110.38
DAVG/cmd: 0.06
KAVG/cmd: 0.00
GAVG/cmd: 0.06
QAVG/cmd: 0.00
I have opened a ticket with vmware but wanted to ask the community if anyone has seen anything like this.
Also I was on the phone with dell pro support for about 3 hours and they wanted me to call vmware since they could not find anything.
All drivers and firmware on the storage are up to date.
I have not put this server into production yet until an answer can be found out. My fear is I will move over servers and there will be an issue.
8.0U1 A04 was released also. I updated one of my servers to this version and my issue still persists.
Yeah, i also tried it with no luck.
Tried latest Broadcom Driver from Broadcom Website.
Both ESXi Systems crashed with high Kernel Latency on DELL H965.
At the moment there is no solution. I tried every driver i found out there.
We got latency spikes from 60 to 90 seconds. Thus means all virtual machines will crash.
Have you opened a case with Dell or VMware?
I would be interested in seeing what they tell you about it.
Sure. They are on investigating the log files. But unfortenately there is no solution yet. We are thinking about replacing the PERC12 with PERC11.
Any fix on your end?
I have contacted Dell to see about replacement my NVMe with SAS SSD setup.
I will let you know what they say.
Yeah, we replaced the H965i with H755N RAID Controller. After that we deactivate Cache of NVMe SSD too. Problem seems to be fixed. The H965i sucks.
The last day latency was under 1ms.
Found out some interesting things !
Same problem with high latency on large blocksizes exists on H755N with Firmware 52.21.1-5149 released on 22.09.23 too.
On Firmware 52.21.0-4606 everything works great ! Maybe you can give a try and downgrade the firmware of H965i to 8.0.0.0.18-86 too ? Maybe Dell has implemted fixes for PERC11 and PERC12 for latest builds and same problem occurs on latest firmware for both controller.
PERC H965i RAID Controller Firmware Version 8.0.0.0.18-86 | Treiberdetails | Dell Deutschland
Thanks. I will communicate this to dell. It won't let me downgrade the firmware with the packages available.....
Hey, why can you not downgrade the firmware ?
If you can passthrough the raid Controller to a windows VM it should work.
The only file format is shows for older firmware is a BIN file, I have not done an update using a BIN file before.
Here is the Windows Version:
https://dl.dell.com/FOLDER09984834M/1/SAS-RAID_Firmware_75RG7_WN64_8.0.0.0.18-86_A01.EXE
Thanks. I downgraded and it still does the same thing with latency.
I did get this from vmware on my specific issue.
The Engineering team have shared the update that:
“We have been actively debugging this issue, but looks to be a tricky one. We have added debug logs from where the stats are fetched, but we see no anomalies there, yet esxtop reports high and negative stats sometimes. We have not yet root caused the issue, debug is still in progress.”
I will keep you updated with the progress.
Can you confirm this issue occurs with RAID 1 or RAID 10 and not just RAID 5? Everyone I've seen thus far reporting this issue were using a RAID 5 configuration. Looking at buying an R760 and now considering picking up the PERC 11 version for now. Though not sure how easy it would be to upgrade in the future.
In my situation it happens on the BOSS Card in a RAID 1 and the NVMe RAID 5.
My personal thought is it is the NVMe part of the setup that is causing the issue but I cannot confirm this until Dell replaces my configuration with a SAS SSD setup or something else.
Dell is currently working on something with it, i will let you know what i find out.
Great- thank you for the information. Can you confirm your Power mode is set to Performance in the BIOS and in VMWare?
Yes. Both are set. I just received a message from my rep at dell and said they are actively working on the issue.
I am also see this latency issue. We just implemented 4 R760's running VMware ESXi, 8.0.1, 22088125 dell custom ISO A04 with vSAN 8 ESA. Using BOSS in RAID 1 for OS and we have 6 nvme drives at 3.49TB per host. I see high latency on the storage path. Disk latency looks to be very low as indicated in this thread. HBA reporting very high latency in the hundreds of thousands, as high as 500,000. Looks like I will be transfering 21TB's back to the old system.
Storage path from one of the R760's. All hosts have high numbers
I thought after I posted this that we are running vSAN so we wouldn't need a raid controller so I went into inventory and all we have controller wise:
405-AACD : No Controller
403-BCRU : BOSS-N1 controller card + wit h 2 M.2 480GB (RAID 1) for the OS.
Any idea's?