VMware Cloud Community
slciec
Enthusiast
Enthusiast

Weird disk latency issue on new R760 with onboard storage. Please help.

We purchased new Dell R760 with 7 onboard NVMe SSD drives that are setup in a RAID 5 on a PERC H965i.
The server are setup with the custom Dell ISO of ESXi 8.0.1 build-21813344.
I was setting up a new windows server to act as a proxy for veeam and notice the server just hung up for around 15 mins (which might not be related to the issue i am posting about). So I went out to see if there was something going on using esxtop. Everything was fine except for the storage area. There was barely any activity happening with the windows server as far as disk activity went but the latency numbers are a real head scratcher. 
This is one sampling and this just randomly happens with 1 VM setup.
CMD/s: 626.54
READS/s: 622.37
WRITES/s: 4.16
MBREAD/s: 2.57
DAVG/cmd: 136384.03
KAVG/cmd: -136383.89 (Yes that is negative)
GAVG/cmd: .014
QAVG/cmd: 142262.59

esxtop idle.png

I downloaded iometer to do a load test of the storage and the numbers showed what i would expect.
CMD/s: 225445.98
READS/s: 113026.56
WRITES/s: 112419.42
MBREAD/s: 110.38
DAVG/cmd: 0.06
KAVG/cmd: 0.00
GAVG/cmd: 0.06
QAVG/cmd: 0.00

esxtop load.png

I have opened a ticket with vmware but wanted to ask the community if anyone has seen anything like this.
Also I was on the phone with dell pro support for about 3 hours and they wanted me to call vmware since they could not find anything.
All drivers and firmware on the storage are up to date.
I have not put this server into production yet until an answer can be found out. My fear is I will move over servers and there will be an issue.

 

0 Kudos
69 Replies
Tinto1970
Commander
Commander

hi, it sounds really strange and I don't think shere is something wrong with the hypervisor.

If the machine is not yet in production, you could try maybe with a linux live cd and test the virtual volume performance (i guess you'll need to destroy the current datastore and forma in a linux usable filesystem)

--
Alessandro aka Tinto VCP-DCV 2023 | VVSPHT 2023 | VMCE 2024 | vExpert 2024 | Veeam Legend
please give me a "Kudo" if you find my answer useful
www.linkedin.com/in/tinivelli
my blog: https://blog.tinivelli.com
0 Kudos
michelkeus_stwg
Enthusiast
Enthusiast

The KAVG/cmd is kernel-related latency and that negative value is definitely something that is not looking good.

If this server is not productional I'd recommend disolving the R5 volume and reviewing if that is resolving that value. I've seen some very weird stuff when using RAID volumes. Using the "loose" disks would allow you to test if it is the configuration of the controller or the controller itself.

0 Kudos
slciec
Enthusiast
Enthusiast

So I deleted the RAID 5 the issue is still showing up on the BOSS RAID 1 where esxi is installed.
Now it does not show the negative value but this is still concerning.

2023-07-17 08_37_11-esx7.illinoiseyecenter.com - PuTTY.png

 

I have vmware and dell both working on the issue because I am not sure who is to blame.

Also I have two other identical servers like this one with the same exact issue.

I will let the community know what is found out in case anyone else runs into this issue.

0 Kudos
michelkeus_stwg
Enthusiast
Enthusiast

I remember that the PERC Cards (including the BOSS Controllers) may have difficulties sharing IRQs and that this could fixed by resetting the configurations on those cards. But it has been a while and I haven't done that for a long time.

Did support already check the Firmware on the cards? Perhaps (re-)flashing the firmware on the cards will alleviate the problems you are having.

0 Kudos
slciec
Enthusiast
Enthusiast

Thanks for the suggestion. I went thru this morning and reset the BOSS and PERC RAID configurations and recreated the BOSS volume.

The issue is still appearing.

I am going to reload the system again using esx 8.0 instead of 8.0U1 to see if it has something to do with that release.

0 Kudos
AnaghB
Enthusiast
Enthusiast

Hello @slciec ,

Post running esxtop command please use the "u" to switch to latency of Disks and then refresh every 2 Sec and see if the DAVG value is going high.

If the DAVG value goes high on any disk with the u parameter then the issue is from the disk and if that disk is local then use the iLO/iDRAC/KVM to perform the extensive diagnostic check on the disks.

Anagh B
VCIX-DCV6.5, VSAN Specialist
Please mark help full or correct if my answer is use full for you
0 Kudos
slciec
Enthusiast
Enthusiast

So i just checked that and the DAVG is pretty much zero when it happens.

esxtopu.png

esxtopd.png

0 Kudos
AnaghB
Enthusiast
Enthusiast

Hello @slciec ,

The screenshot that you have shared with latency of the Local disk shows that the latency is near to 0. This means that the disk is performing well without any issues. 

For the HBA showing weired values is a glitch and for that perform 2 tasks.

1. Reboot the Esxi host and see if issue is appearing again

2. Get the Driver and firmware upgraded to latest compatible version to confirm that the HBA stats are clean.

 

Anagh B
VCIX-DCV6.5, VSAN Specialist
Please mark help full or correct if my answer is use full for you
0 Kudos
slciec
Enthusiast
Enthusiast

This is what support has told me about this.

"The abnormal latency reported in the esxtop values is been investigated on by the Engineering team.
I believe this behavior to be a cosmetic one, as there are no other issues reported on the host, however we would not be able to confirm the same until we have an confirmation and further action plan provided by the Engineering team."

 

0 Kudos
AnaghB
Enthusiast
Enthusiast

Hello @slciec ,

As I mentioned its a cosmetic issue and there is no latency. There are 3 possible plan of Action that might be suggested by support team.

1. reboot the Esxi host and see if the issue is still observed again.

2. Upgrade the HBA driver and Firmware to latest compatible Version as per Vmware HCL Matrix.

3. Upgrade the Esxi version to next build.

This is not a real issue and you can continue using the same host on Production.

 

Anagh B
VCIX-DCV6.5, VSAN Specialist
Please mark help full or correct if my answer is use full for you
0 Kudos
slciec
Enthusiast
Enthusiast

Reboot does not fix the issue.

All HBA drivers and firmware has been already applied. This was done when I opened a ticket with Dell Support.

I installed multiple versions of ESXI, I even went back to version 7.0U3n which is in the HCL Matrix for the server as being supported.

On all versions it happened.

I was just updating the community in case anyone else has this issue.

If I get anymore information i will pass it along.

Thanks for responding.

Chok45
Contributor
Contributor

We have the exactly same issues with 8x Dell R760 Server and Dell PERC12 RAID Controller H965i. Extreme high latencies in local RAID with 6x NVMe SSDs. Tried different VMware Versions with Dell customized Image. From 7 to 8 and so on. It seems that there is Firmware Problem or Driver Problem with esxi. At the moment These systems are useless because Performance is slower than in SATA Drives. 

Chok45
Contributor
Contributor

I can confirm that this ist not cosmetic. On snapshot removal latencies jumps to 400ms and Higher. 6x NVME SSD in Raid-5. 

0 Kudos
slciec
Enthusiast
Enthusiast

This is the last email I received from tech support. I am going to respond letting them know another customer is having the same issue.

It looks like it's an intermittent issue on 8.0.1 and wasn't seen on the latest main but since it was still seen in
8.0.1 We now trying to root cause issue on 8.0.1.
As per the current investigation following are the obeservations.

1.) These values go out of range on 8.0.1 and not on main (we are yet to confirm this with adequate experimentation)
2.) It is seen on large IO sizes usually around 4M or higher.
3.) On lower-size IOs we don't see this issue and values are just fine in that case.
4.) This issue seems to occur only in nvme case not in scsi devices.


With that said we are still investigating the root cause of the issue.

0 Kudos
Chok45
Contributor
Contributor

Holy **bleep**. That makes Sense why the latency bumps up when Backup is running. 

0 Kudos
slciec
Enthusiast
Enthusiast

Also, not sure of your configuration but I also had this issue with my R760's and error messages.

But it was easily solved.

Re: Failed to cleanup registration key on volume - VMware Technology Network VMTN

0 Kudos
Chok45
Contributor
Contributor

Mhh, we use the Boss Card with VMFS for HCI Deployment with StarWind VSAN. I cannot delete the Partition because OS of StarWind Appliance is laying on it. But we have no issues with Boss Card. Only with Perc12 H965i. 

0 Kudos
Chok45
Contributor
Contributor

Dell released a new ESXi Image for vSphere 7 Yesterday: 

 

https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=x3djh&oscode=xi70&productcod...

There is a new Driver for PERC RAID Controller too which is not the native one. I will give it a try. 

0 Kudos
Chok45
Contributor
Contributor

I will also try Dell customized Image A12. It has the bcm_mpi3 version 8.1.1.0.0.0-1OEM integraded. 

https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=pk7wn&oscode=xi70&productcod...

0 Kudos