VMware Cloud Community
slciec
Enthusiast
Enthusiast

Weird disk latency issue on new R760 with onboard storage. Please help.

We purchased new Dell R760 with 7 onboard NVMe SSD drives that are setup in a RAID 5 on a PERC H965i.
The server are setup with the custom Dell ISO of ESXi 8.0.1 build-21813344.
I was setting up a new windows server to act as a proxy for veeam and notice the server just hung up for around 15 mins (which might not be related to the issue i am posting about). So I went out to see if there was something going on using esxtop. Everything was fine except for the storage area. There was barely any activity happening with the windows server as far as disk activity went but the latency numbers are a real head scratcher. 
This is one sampling and this just randomly happens with 1 VM setup.
CMD/s: 626.54
READS/s: 622.37
WRITES/s: 4.16
MBREAD/s: 2.57
DAVG/cmd: 136384.03
KAVG/cmd: -136383.89 (Yes that is negative)
GAVG/cmd: .014
QAVG/cmd: 142262.59

esxtop idle.png

I downloaded iometer to do a load test of the storage and the numbers showed what i would expect.
CMD/s: 225445.98
READS/s: 113026.56
WRITES/s: 112419.42
MBREAD/s: 110.38
DAVG/cmd: 0.06
KAVG/cmd: 0.00
GAVG/cmd: 0.06
QAVG/cmd: 0.00

esxtop load.png

I have opened a ticket with vmware but wanted to ask the community if anyone has seen anything like this.
Also I was on the phone with dell pro support for about 3 hours and they wanted me to call vmware since they could not find anything.
All drivers and firmware on the storage are up to date.
I have not put this server into production yet until an answer can be found out. My fear is I will move over servers and there will be an issue.

 

0 Kudos
69 Replies
slciec
Enthusiast
Enthusiast

8.0U1 A04 was released also. I updated one of my servers to this version and my issue still persists.

 

0 Kudos
Chok45
Contributor
Contributor

Yeah, i also tried it with no luck.

0 Kudos
Chok45
Contributor
Contributor

Tried latest Broadcom Driver from Broadcom Website. 

Both ESXi Systems crashed with high Kernel Latency on DELL H965. 

 

0 Kudos
Chok45
Contributor
Contributor

At the moment there is no solution. I tried every driver i found out there. 

We got latency spikes from 60 to 90 seconds. Thus means all virtual machines will crash. 

 

 

0 Kudos
slciec
Enthusiast
Enthusiast

Have you opened a case with Dell or VMware?

I would be interested in seeing what they tell you about it.

0 Kudos
Chok45
Contributor
Contributor

Sure. They are on investigating the log files. But unfortenately there is no solution yet. We are thinking about replacing the PERC12 with PERC11.

0 Kudos
slciec
Enthusiast
Enthusiast

Any fix on your end?

I have contacted Dell to see about replacement my NVMe with SAS SSD setup.

I will let you know what they say.

0 Kudos
Chok45
Contributor
Contributor

Yeah, we replaced the H965i with H755N RAID Controller. After that we deactivate Cache of NVMe SSD too. Problem seems to be fixed. The H965i sucks. 

The last day latency was under 1ms. 

 

0 Kudos
Chok45
Contributor
Contributor

Found out some interesting things !

 

Same problem with high latency on large blocksizes exists on H755N with Firmware 52.21.1-5149 released on 22.09.23 too. 

 

On Firmware 52.21.0-4606 everything works great ! Maybe you can give a try and downgrade the firmware of H965i to 8.0.0.0.18-86 too ? Maybe Dell has implemted fixes for PERC11 and PERC12 for latest builds and same problem occurs on latest firmware for both controller.

PERC H965i RAID Controller Firmware Version 8.0.0.0.18-86 | Treiberdetails | Dell Deutschland

0 Kudos
slciec
Enthusiast
Enthusiast

Thanks. I will communicate this to dell. It won't let me downgrade the firmware with the packages available.....

 

0 Kudos
Chok45
Contributor
Contributor

Hey, why can you not downgrade the firmware ? 

If you can passthrough the raid Controller to a windows VM it should work.

 

0 Kudos
slciec
Enthusiast
Enthusiast

The only file format is shows for older firmware is a BIN file, I have not done an update using a BIN file before.

 

 

Tags (1)
0 Kudos
Chok45
Contributor
Contributor

0 Kudos
slciec
Enthusiast
Enthusiast

Thanks. I downgraded and it still does the same thing with latency.

I did get this from vmware on my specific issue.

The Engineering team have shared the update that:

“We have been actively debugging this issue, but looks to be a tricky one. We have added debug logs from where the stats are fetched, but we see no anomalies there, yet esxtop reports high and negative stats sometimes. We have not yet root caused the issue, debug is still in progress.”

I will keep you updated with the progress.

adevereaux
Contributor
Contributor

Can you confirm this issue occurs with RAID 1 or RAID 10 and not just RAID 5? Everyone I've seen thus far reporting this issue were using a RAID 5 configuration. Looking at buying an R760 and now considering picking up the PERC 11 version for now. Though not sure how easy it would be to upgrade in the future.

0 Kudos
slciec
Enthusiast
Enthusiast

In my situation it happens on the BOSS Card in a RAID 1 and the NVMe RAID 5.

My personal thought is it is the NVMe part of the setup that is causing the issue but I cannot confirm this until Dell replaces my configuration with a SAS SSD setup or something else.

Dell is currently working on something with it, i will let you know what i find out.

 

0 Kudos
adevereaux
Contributor
Contributor

Great- thank you for the information. Can you confirm your Power mode is set to Performance in the BIOS and in VMWare?

0 Kudos
slciec
Enthusiast
Enthusiast

Yes. Both are set. I just received a message from my rep at dell and said they are actively working on the issue.

0 Kudos
ViSioN0101
Contributor
Contributor

I am also see this latency issue.  We just implemented 4 R760's running VMware ESXi, 8.0.1, 22088125 dell custom ISO A04 with vSAN 8 ESA.  Using BOSS in RAID 1 for OS and we have 6 nvme drives at 3.49TB per host.   I see high latency on the storage path.  Disk latency looks to be very low as indicated in this thread.  HBA reporting very high latency in the hundreds of thousands, as high as 500,000.  Looks like I will be transfering 21TB's back to the old system.

 

Storage path from one of the R760's.  All hosts have high numbers

pcie.b100-pcie.0:0-eui.36563130575165740025384300000003Read latencyAveragems 238,185 262,582 23,730 238,283.28
Select
 
pcie.c400-pcie.0:0-eui.36563130575163600025384300000003Read latencyAveragems 44,618 2,241.194
Select
 
pcie.100-pcie.0:0-t10.NVMe____Dell_BOSS2DN1____________________________0100000992435000Read latencyAveragems29,998853,652047,928.68
Select
 
pcie.b000-pcie.0:0-eui.36563130575163640025384300000003Read latencyAveragems 42,699 1,616.539
Select
 
pcie.c300-pcie.0:0-eui.36563130575165750025384300000003Read latencyAveragems 16,502 500,096 21,591.111
Select
 
pcie.ae00-pcie.0:0-eui.36563130575165720025384300000003Read latencyAveragems 49,460 1,970.4
Select
 
usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0Read latencyAveragems0
Select
 
pcie.af00-pcie.0:0-eui.36563130575165770025384300000003Read latencyAveragems 52,370 2,486.539
Select
 
pcie.af00-pcie.0:0-eui.36563130575165770025384300000003Write latencyAveragems 5,761 491,199 1,674 34,693.3
Select
 
pcie.b000-pcie.0:0-eui.36563130575163640025384300000003Write latencyAveragems 144,727 7,676.672
Select
 
usb.vmhba32-usb.0:0-mpx.vmhba32:C0:T0:L0Write latencyAveragems 0
Select
 
pcie.b100-pcie.0:0-eui.36563130575165740025384300000003Write latencyAveragems 1,039 214,675 4,763.828
Select
 
pcie.c300-pcie.0:0-eui.36563130575165750025384300000003Write latencyAveragems 1,031 249,404 310 37,498.707
Select
 
pcie.100-pcie.0:0-t10.NVMe____Dell_BOSS2DN1____________________________0100000992435000Write latencyAveragems26,136 1,393.111
Select
 
pcie.ae00-pcie.0:0-eui.36563130575165720025384300000003Write latencyAveragems 19,114 496,225 31,999.482
Select
 
pcie.c400-pcie.0:0-eui.36563130575163600025384300000003Write latencyAveragems 135,047 6,945.561
0 Kudos
ViSioN0101
Contributor
Contributor

I thought after I posted this that we are running vSAN so we wouldn't need a raid controller so I went into inventory and all we have controller wise:

405-AACD : No Controller

403-BCRU : BOSS-N1 controller card + wit h 2 M.2 480GB (RAID 1) for the OS.

Any idea's?

0 Kudos