VMware Cloud Community
slciec
Enthusiast
Enthusiast

Weird disk latency issue on new R760 with onboard storage. Please help.

We purchased new Dell R760 with 7 onboard NVMe SSD drives that are setup in a RAID 5 on a PERC H965i.
The server are setup with the custom Dell ISO of ESXi 8.0.1 build-21813344.
I was setting up a new windows server to act as a proxy for veeam and notice the server just hung up for around 15 mins (which might not be related to the issue i am posting about). So I went out to see if there was something going on using esxtop. Everything was fine except for the storage area. There was barely any activity happening with the windows server as far as disk activity went but the latency numbers are a real head scratcher. 
This is one sampling and this just randomly happens with 1 VM setup.
CMD/s: 626.54
READS/s: 622.37
WRITES/s: 4.16
MBREAD/s: 2.57
DAVG/cmd: 136384.03
KAVG/cmd: -136383.89 (Yes that is negative)
GAVG/cmd: .014
QAVG/cmd: 142262.59

esxtop idle.png

I downloaded iometer to do a load test of the storage and the numbers showed what i would expect.
CMD/s: 225445.98
READS/s: 113026.56
WRITES/s: 112419.42
MBREAD/s: 110.38
DAVG/cmd: 0.06
KAVG/cmd: 0.00
GAVG/cmd: 0.06
QAVG/cmd: 0.00

esxtop load.png

I have opened a ticket with vmware but wanted to ask the community if anyone has seen anything like this.
Also I was on the phone with dell pro support for about 3 hours and they wanted me to call vmware since they could not find anything.
All drivers and firmware on the storage are up to date.
I have not put this server into production yet until an answer can be found out. My fear is I will move over servers and there will be an issue.

 

Reply
0 Kudos
57 Replies
Chok45
Contributor
Contributor

Hey Guys,

 

we found out some very interesting things. 

 

First one: Abnormal high latency inside ESXTOP is a known issue and Dell has actualized the Known Issues section:

 

VMware vSphere ESXi 8.x on Dell PowerEdge Systems Release Notes | Dell D eutschland

 

Second one: We also had real latency issues with H755N AND H965i RAID Controller inside 2x R760xs Systems. 

 

After Troubleshooting of hours we connect the RAID Cards to other SL_Connector on Server Mainboard. 

After that all Latency issues are gone !! If we connect them back restart the host and did copy tests latency jumps back to 300 - 1000ms. Really Really strange. We can replicate this issue on other Dell R760xs too.

The Really strange thing is that this issues is on R760 with other mainboard design too. We have 6x Dell R760 with DUAL RAID Controller H965i. One of the RAID Cards per Server has the same Problem. On Every system. One RAID Card is working fine. One has latency issues. On R760 Dual RAID Configs it is not possible to easy connect the cards to other connector on mainboard because both connectors are in use. On R760xs Single Config this is possible.  

 

On Factory Config of R760xs the RAID Cards were connected to SL5_CPU1_PA3 on Mainbord. We connect them to SL4_CPU2_PA2 and then Latency where great. Problem exists with different RAID Cards so it cannot be the card itself.  

 

Chok45
Contributor
Contributor

We also did a downgrade to vSphere 7 because on 8 we had problems with iscsi iser adapter which were lost after reboot.

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

So are you thinking it has something to do with the mainboard?

I finally was connected to a couple Dell senior support engineers and they had me download a SLI ISO to do testing of the setup using fio and iostat.

I broke the raid and made all the NVME drives as non-raid, then I recreated the RAID and ran the test.

Everything has been sent to Dell this morning so I will have to wait and see what they say.

Also I have told them to read thru this thread to see what everyone else is saying.

I hope to get a solution at some point from dell or vmware, cause I have bunch of new equipment that are just bricks right now.

Chok45
Contributor
Contributor

Yeah it can be. Because at the moment our systems are running absolutely fine without latency issues since we change the mainboard connector from SL5_CPU1_PA3 to SL4_CPU2_PA2 . 

 

 

 

Chok45
Contributor
Contributor

Yeah i can absolutely understand you. We are on the same boat. Had 6x R760 Server which Dell willl replace with systems with H755 Single RAID Controller and SAS SSD (instead NVME). Hopefully these systems will run better.

Reply
0 Kudos
Chok45
Contributor
Contributor

Do you have Single RAID Controller config ? If yes you can try to attach the Data Cable to the second Connector on mainboard and check if latency problems still exists. Becareful you have to clear config on idrac because of validation errors after you change these port.  

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

Each server has a dual H965i, but drives are only connected to one of the two raid cards.

Once dell support gets back to me I might try doing that.

Reply
0 Kudos
Chok45
Contributor
Contributor

Okay, then you can try to attach 3 Drives to Controller 1 Backplane and 3 Drives to Controller 2 Backplane.

Create RAID-5 and Datastores within VMware and Check Performance. Maybe you will see that the second Controller has no problems. Then you have the exactly same Problem like we have. 

 

Controller 1 (left) is connected to SL1_CPU2_PA1

Controller 2 (right) is connected to SL3_CPU1_PA2

Regardless which Controller is connected to SL3_CPU1_PA2. This one has latency issues under VMware (local) AND via PCIe Passthrough. 

 

Reply
0 Kudos
Chok45
Contributor
Contributor

Did you run the fio test with big block sizes ? >=64KB

 

We found out that the latency problem only occur on big block sizes. 

Tags (1)
Reply
0 Kudos
slciec
Enthusiast
Enthusiast

128k, this is the configuration i used with the test.

[global]
rw=write
numjobs=1
iodepth=128
ioengine=libaio
time_based
runtime=600
bs=128k
direct=1

Reply
0 Kudos
Chok45
Contributor
Contributor

I think 128k Block SIze is too small to represent the issue.

 

Can you please check how the results are with 256K, 512K and 1MB Block Sizes ?

We saw the problem with Block Sizes bigger than 512K.

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

This is on a RAID5 with 7 drives. Block size 512k

slciec_0-1696861985015.png

 

Block size 1024k

slciec_1-1696862008814.png

 

Reply
0 Kudos
adevereaux
Contributor
Contributor

So, unless I am mistaken- those results look good? Those are based on being plugged into which backplane connector?

 

What does it look like if you set numjobs= to a higher value?

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

I increased the jobs to 16 this is what i got back.

slciec_0-1696882522992.png

I also ran this to measure random read/write performance.

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=sbd --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Results

slciec_1-1696883647370.png

 

slciec_2-1696883654692.png

The numbers look fine to me unless I am reading something wrong.

Also I would note this is not a test done with esxi installed, I am using a dell image that is running Rocky Linux 8.8 with all the appropriate drivers and test utilites on it.

After running all these test I get the feeling this is not specifically hardware related as much as it is esx related with drivers or something.

 

Reply
0 Kudos
Chok45
Contributor
Contributor

Yeah i absolutely agree that it can be a problem with ESXi and PowerEdge x60 PowerEdge Systems. We didnt saw the Latency problems when Local Microsoft Windows with latest Driver was installed. The Problem only occured on VMware ESXi with PCIe or non PCIe Passthrough. Maybe there is something wrong with PCIe Bus or something else. Problem exists only on one Connector on Mainboard with VMware.

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

Dell got back to me and basically told me the thing we already knew.

"The hardware is performing as expected. While in the Support Live Image, all the drives observes extremely low latency times on the tests you performed, and the overall performance was very good and pretty consistent. All of this does point toward ESXi/VMware being the bottleneck, unfortunately."

I updated my ticket with Vmware but if I don't hear back from them I am unsure what to do next.

 

Reply
0 Kudos
slciec
Enthusiast
Enthusiast

VMWare got back to me and told me this is a cosmetic issue and the numbers I am seeing are wrong.

slciec_0-1697562880580.png

 

So in my case this is probably true since I don't see any issues when running load test on the virtual machine.

 

Reply
0 Kudos
Chok45
Contributor
Contributor

Hey,

as i said the high latency values under ESXTOP is a bug and a cosmetic thing on VMware vSphere.

But we saw real latency issues on our environment inside Windows VM´s with PCIe Passthrough enabled. On local file copy latency jumped to 60sec and more and datastores will crash. 

We can replicate the issue without installation of starwind vsan. 

Reply
0 Kudos