VMware Cloud Community
andyarnet
Contributor
Contributor

Intel DC P3700 Firmware

We are attempting to build a new 4-node VSAN hybrid cluster using Intel P3700 SSDs.  After getting everything together, the health checks fail saying the driver is not on the HCL, even though we are using the correct driver.  Opened an SR, and support is telling me that the problem is actually the firmware on the SSDs.  The drives shipped with FW0171, but FW0131 is required.  We are trying to get a response from Intel to see if the firmware can be downgraded, but so far, no luck.

As far as I can tell, the Intel P3700 and P3600 are the only 2.5" NVMe drives on the HCL, and unless you have old drives, they can't be used.  Has anyone had any experience with this drive in VSAN? 

Tags (1)
34 Replies
elerium
Hot Shot
Hot Shot

I've always had the Intel P3700 give a warning on healthcheck for the HCL category since VSAN 6. I more or less ignored that since I know the part really is on HCL.

I use these models of P3700 (VID,DID, SVID, SSID):

Intel DC P3700 400GB HHHL, SSDPEDMD400G4, 8086:0953 8086:3702

Intel DC P3700 1.6TB HHHL, SSDPEDMD016T4, 8086:0953 8086:3702

Intel DC P3700 2.0TB HHHL, SSDPEDMD020T4, 8086:0953 8086:3702

Looking at some other PCIE databases https://pci-ids.ucw.cz/read/PC/8086/0953 , maybe SSID 3703 only refers to the 2.5" SFF version and HHHL is SSID 3702? If so, the HCL has the SSID for the HHHL version entered incorrectly.

Reply
0 Kudos
eode
Enthusiast
Enthusiast

Yes, that's my thoughts as well. I've commented this in my SR w/VMware. If lucky, the VSAN HCL "DB" (JSON-file) will be corrected (unless I'm misunderstanding the logic of the HCL-check).

Reply
0 Kudos
eode
Enthusiast
Enthusiast

Quick update, regarding VSAN HCL DB:

Got confirmed in our SR that the "Warning on the P3700 HHHL AIC" was due to a VSAN Health-plugin issue, and could be ignored (will be fixed in the following health releases).

Regarding firmware-version:

Was also told to downgrade the stock firmware (8DV10171) to 8DV10131. Hopefully the VSAN HCL DB will be updated shortly (based on Intel's response from yesterday, regarding the *171-firmware, it should be verified/added by VMware next week). Wondering about VMware's updated recommendation on driver (with the new firmware).

Reply
0 Kudos
ailark
Contributor
Contributor

Any update? I have exact same problem with Intel P3700 *171 firmware ssd-s and VSAN.

Reply
0 Kudos
andyarnet
Contributor
Contributor

Great news, just got word from our vendor that the P3700 w/ FW0171 is now certified for VSAN:

VMware Compatibility Guide - ssd

We've been using this with the 1.2.0.27-4vmw driver for a while with no problems, but I know Elerium was reporting some write speed limitations with checksum enabled using the 1.2.0.27-4vmw driver.  I see that the intel-nvme 1.0e.2.0 driver is also listed on the HCL, I'm curious if there might be an improvement using this driver.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I've tried all available (intel-nvme 1.0e1.1, intel-nvme 1.0e2.0, nvme 1.0e.035-1vmw (Inbox) and nvme 1.2.0.27-4vmw), all work at different degrees of poor on VSAN 6.2 if you leave checksum enabled. If you MUST use checksum use nvme 1.2.0.27-4vmw, the two intel-nvme drivers are pretty much unusable with checksum enabled. Using nvme 1.2.0.27-4vmw with checksum on will work fine for most workloads that don't involve constant large sequential writes. However, if your VSAN goes into resync in this setup, you can expect serious performance problems and horrible latencies (500ms+) during resync. I have 4 different clusters, hybrid and AF, all using Intel P3600 or P3700 where I have been able to reproduce this behavior consistently. I don't have any other NVME SSDs to test with so there is a possibility still that something else is the cause, but my best guess is it's the nvme driver, SSD firmware or maybe a bug in checksum implementation. 

If you are using P3700 or P3600 and VSAN 6.2, I would recommend disabling checksum in all your storage policies to avoid the problems above. If you're using VSAN 6.1 or VSAN 6.0 (which doesn't support checksum), you won't see any issues whatsoever. I still have an SR open with VMware about the checksum issue, their development is still looking at it and there's no ETA for a fix yet.

I raised this issue with Intel and you can read that here Firmware Downgrade |Intel Communities, their reply is below:

We have already worked with VMware* to have our FW171 added into the HCL and the expectation is for VMWare* to have it updated by next week, you can keep an eye on their website.

On the other hand, the FW added into the HCL has no relationship with the SW Checksum added by VMWare* (as an option) to their VSAN 6.2; therefore, the latest FW will not fix the latency issues associated with the SW Checksum when enabled, as this is not really related to our drives, since our drives where designed and tested for high integrity and therefore non-validated or intended to operate with VSAN's SW Checksum feature.

At the end it is up to everyone whether to rely on our high integrity SSD's or enable a SW Checksum which will add latency and therefore sacrifice performance.

Let us know if you need more information.

Reply
0 Kudos
m0ps
Enthusiast
Enthusiast

P3700 HHHL AIC is on HCL now
VMware Compatibility Guide - vfrc

best regards, m0ps
Reply
0 Kudos
elerium
Hot Shot
Hot Shot

After opening an unrelated case regarding poor write latency and failed drives and being given a command to change LSOM congestion limits by support, I've found that the following settings (run on each host) increased overall write performance for me by at least 50% (my resync speeds are now 100% faster too)!

esxcfg-advcfg -s 24 /LSOM/lsomLogCongestionLowLimitGB

esxcfg-advcfg -s 48 /LSOM/lsomLogCongestionHighLimitGB

(defaults are 16 and 24 respectively if you need to revert)

From what I have figured out, the values above configure the min and max write log size. It directly determines congestion levels (congestion starts when the lowlimitGB value is reached and the highlimitGB sets the max congestion level). As a result, this write log acts a buffer, storing commands that will be written to the capacity disks and is stored on the cache SSD. I have found that increasing this buffer dramatically increases performance when using P3700 + magnetic capacity disks. In IOPs benchmark testing, I see a 50% write improvement and in resync operations i see 100% throughput performance! I also previously had to disable checksum as I had poor write latencies leaving it enabled when VSAN was performing resync, this has fixed that as well for me and I can finally enable checksum!

As far as I can tell, the downsides are: This buffer increases capacity use on your cache SSD (by the GB amount specified), I really think this drawback is minimal unless you are using tiny sized SSD cache. Secondly, if you increase the values too much (for me it was past 64GB for the HighLimit), VSAN won't throttle VM latencies so it will put higher I/O strain on the capacity layer which may ultimately affect VM latency. You should do your own testing to see what works on your environment, but just wanted to put this out there as I practically got a free 50-100% write boost by changing these settings.

Reply
0 Kudos
C3LLC
Contributor
Contributor

Espen-

Did you ever get this figured out?   We are still seeing health warnings and have the 3702 SSID listed as well.

Thoughts?

RIck

Reply
0 Kudos
stubbedo
Contributor
Contributor

We have the exact same issue P3700 same driver (1.2.0.27-4vmw..), and nvme P3700 FW  8DV10171, yet the healthcheck comes back with "Warning"

When will the HCL be updated to not point us to use an obscure version of fw for a Cisco rebranded card?

Reply
0 Kudos
devros69
Contributor
Contributor

Vsphere is now telling us that it wants to be on the FJP7 firmware.  We are currently on 8DV10171 and the intel tool tells us it would upgrade it to 8DV101H0?!?  What firmware are people on here having the most success with ATM?  We are on the 1.2.0.27-4vmw driver.

thanks,

-ed

Reply
0 Kudos
andyarnet
Contributor
Contributor

We're currently on 8DV10171 with 1.2.0.32-4vmw on 6.5U1  I don't know what the FJP7 comes from, looks like it is a Fujitsu firmware, but we are using Intel branded P3700s.  Everything is working fine for us and everything is on the HCL.  It seems that the health check is never going to work with these drives. 

One thing to note, at one point we started seeing slow cloning performance after an update while we were on 6.0U3.  Just found this: https://kb.vmware.com/kb/2149876.  Doesn't seem to affect normal operations, but very disappointing as we have invested heavily in the P3700s as cache drives in all of our hybrid and all-flash clusters.

Reply
0 Kudos
adarobin
Enthusiast
Enthusiast

We are testing some P3700s for vFlash Read Cache and the performance is shit there as well.  I asked support if the KB you mentioned, could also cause issues with vFlash Read Cache and I was told

"we do know that the NVMe device in question (I.E. your HP "MO1600KEFHQ"  also known as "1.6TB NVMe PCIe Write Intensive SFF 2.5-in SC2 764892-B21" which is basically a re-branded Intel P3xxx ) is affected by serious performance issues when dealing with continuous writes in the same blocks or block range.  Although this was noticed specifically with vSAN environments, it is reasonable to state that the bad performance "scenario" (it's actually not a bug, I believe, of the device itself but rather a design flaw) of the P3xxx would be exploited with any intense use of the device itself, like for example, vFlash cache in conjunction with write intensive applications like DB servers."

Reply
0 Kudos
moto316
Enthusiast
Enthusiast

We were directed by GSS to disable the firmware version healthcheck. We are running retail Intel P3700's and the DID, VID, SDID and SVID values are identical for both Intel and Fujitsu and atleast one other brand.

We were also instructed to disable log compaction in regards to KB2149876. If log compaction isn't disabled you'll notice very high latency values during certain scenarios (I uncovered this while watching esxtop after completion of a proactive stress test. During the deletion process the latency would spike for 15-20 minutes)

After these two changes our vSAN environments have been performing flawlessly on retail Intel P3700's and S3520's for capacity.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I have a ton of Intel P3700s in production, have only seen the high latency values on the P3700 infrequently, but unfortunately at some critical times (during resync from dying hardware or during a host failure). How do you disable log compaction in VSAN? And if disabled, can it be safely re-enabled at a later time?

Reply
0 Kudos