I am running VSAN 6.1 hybrid with P3700s and have been before they were officially supported on HCL. I have not run into issues with VSAN that are related to the P3700. I'm using the HHHL form factor and not 2.5" though so not sure if that makes a difference. Although I generally agree with recommendations to firmly stick with VSAN HCL, I haven't run into any problems at all from P3700s on any firmware versions. With no negative impacts, I'm not seeing a reason to downgrade the stock 8DV10171 firmwares that are shipping with these disks. That said I'm not seeing a reason to upgrade to newer firmwares either when they're released since they probably won't be qualified on VSAN HCL yet.
I'm also not aware of any way to downgrade the firmwares from higher versions to the 8DV10131 that's on HCL.
We're having the same problem with the HHHL P3700 drives. We're not experiencing any issues other than the Health service saying the drives aren't on the HCL.
I've opened case with VMware who told me it was Intel's responsibility to make sure HCL is correct. I've had case open with Intel (Case 00288969) for a couple weeks now with no progress. They've had me upgrade to the lastest driver (1.0e-2.0-1OEM.522.214.171.1241871) and firmware from 8DV10131 to 8DV10171. No luck.
Maybe it'd be good for you to also open a case with Intel and reference my case so they can see it's affecting more than one person.
Same problem with DC P3600
It is on the HCL but showing up like it isn't in the HCL.
Last firmware, last drivers ...
Also LSI 3008 is in the HCL but not showing up as it is in HCL
Last firmware, last drivers ...
Actually with regards to flash devices and drives the statement is that there is a minimum level of firmware which is on the HCL, anything higher is supported as far as I know. I will ask the engineering team to bake this logic in to the health check HCL team.
EDIT: Apparently this does not apply to the Intel P3700 devices, what is listed on the HCL is a hard requirement, so please do not use a higher version!
The driver we are using is 126.96.36.199-4vmw.5188.8.131.521820, and have had no problems as far as I can tell. We started off with 1.0e.1.1-1OEM.5184.108.40.2061871 which is the driver listed on the HCL and had all sorts of problems. It is my understanding that Intel is in the process of recertifying the P3700/P3600 with the updated firmware.
I've been told by VMware support that they will support these drives with this driver/firmware combination, so we have moved the cluster into production. I would love to see the HCL warning go away soon though
Out of curiosity, what kind of problems were you seeing before updating the driver?
Also what version of VSAN?
We were seeing congestion errors on the SSDs while running stress tests for any more than a couple minutes. High latency and just crappy performance in general. Intel told us it was due to the driver not matching the 8DV10171 firmware.
We're running 6.2, and performance is looking really good at this point.
I am actually seeing the same performance/congestion related issues on my 6.2 lab, specifically with write performance, while everything is working perfectly in 6.1 When I disabled the new 6.2 checksum feature in storage policies it went away, but I'd rather have that option enabled on my clusters.
I'll give the driver update a try! Thanks for sharing.
A fast question ...
How do I get the firmware version of the P3600 800GB SSD ?
We to see a lot of latency sometimes +350ms
We are using driver version : 1.0e.0.35-1vmw.
But if you would like to go to version : 220.127.116.11-4vmw
you have to be on Firmware version : 8DV10171
So I would like to check the firmware version of the SSD so i can upgrade that first.
Here are the warning I get from Vmware although they are in the HCL
Device Driver in use Driver health vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF] nvme (1.0e.0.35-1vmw.600.2.34.3620759) Warning vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Express lsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758) Warning vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF] nvme (1.0e.0.35-1vmw.600.2.34.3620759) Warning vmhba3: Avago (LSI Logic) / Symbios Logic Avago (LSI)3008 lsi_msgpt3 (12.00.00.00-1OEM.600.0.0.2768847) Warning vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF] nvme (1.0e.0.35-1vmw.600.2.34.3620759) Warning vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Express lsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758) Warning
Thanks in advance
You can install the SSD Data Center Tool VIB and use it to find the firmware version. Although the easiest way would be to pull the drive, the FW version is printed on the drive (at least it is on our P3700's)
The 18.104.22.168-4vmw.522.214.171.1241820 driver did significantly improve congestion and improve latency in general over the intel-nvme drivers. I'm still seeing an issue where sequential writes and limited to no higher than 250MB/s from VM guests, but only with checksum enabled (disabled i get 800MB+ write speed). Maybe a raid controller or raid driver as I'm using Dell/H730 which isn't on HCLed for 6.2 yet, latest I heard from support is that Dell/VMware may have my raid controller added to 6.2 HCL by end of May.
Has there been any traction on this? I'm also hitting up Intel on their end (Firmware Downgrade |Intel Communities) with the same issue to see if we can push this along. According to Intel, the certification validation lies with VMWare at this point. From what a VMWare Federal Escalation Engineer told me during a call for an unrelated service request, VMWare can either certify in-house OR request results from the hardware company to analyze for certification.
I have a mix of VSAN 6.1/6.2 hybrid and all flash clusters, all using P3700 or P3600 for cache. I can tell you that in VSAN 6.2, using the HCL firmware/driver combo you would see very poor performance, congestion and latency problems. I recently built a new 6.2 VSAN all flash cluster that happened to ship with 8DV10131 (HCL) firmware. Using 1.0e.1.1-1OEM.5126.96.36.1991871 HCL driver, there are severe write performance related issues. Result is the same after upgrading to firmware 8DV10171. Between VMWare and Intel or whoever is responsible for updating the HCL, I don't think any real testing went into it before it got 6.2 qualified. Testing the HCL combo even for 5 minutes, one would immediately notice a major latency/congestion issue, on even light stress testing. I believe it also has something to do with the new checksum functionality added in 6.2, if disabled in the storage profile, all performance returns back to normal levels.
Personally I don't think it's an issue of downgrading firmware but for Intel to release a new inte-nvme driver (and or firmware update) that resolves issues discovered for version for 6.2. Also none of the issues exist on 6.1 (probably because checksum feature isn't on 6.1).
Here are my findings from 6.2 AF VSAN using Intel P3700 400GB for write cache and 4x Intel S3510 800GB for capacity:
P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.1.1-1OEM.5188.8.131.521871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum
P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.2.0-1OEM.5184.108.40.2061871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum
P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.1.1-1OEM.5220.127.116.111871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum
P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.2.0-1OEM.518.104.22.1681871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum
P3700 400GB, firmware 8DV10171, nvme 22.214.171.124-4vmw.5126.96.36.1991820 driver - no latency/congestion problems, sequential writes limited to 250MB/s, no issues if disabling checksum
I posted the same response to the Intel forums, I'll probably open a case with Intel in the next day or two hopefully will get more visibility to Intel as well.