VMware Cloud Community
andyarnet
Contributor
Contributor

Intel DC P3700 Firmware

We are attempting to build a new 4-node VSAN hybrid cluster using Intel P3700 SSDs.  After getting everything together, the health checks fail saying the driver is not on the HCL, even though we are using the correct driver.  Opened an SR, and support is telling me that the problem is actually the firmware on the SSDs.  The drives shipped with FW0171, but FW0131 is required.  We are trying to get a response from Intel to see if the firmware can be downgraded, but so far, no luck.

As far as I can tell, the Intel P3700 and P3600 are the only 2.5" NVMe drives on the HCL, and unless you have old drives, they can't be used.  Has anyone had any experience with this drive in VSAN? 

Tags (1)
34 Replies
elerium
Hot Shot
Hot Shot

I am running VSAN 6.1 hybrid with P3700s and have been before they were officially supported on HCL. I have not run into issues with VSAN that are related to the P3700. I'm using the HHHL form factor and not 2.5" though so not sure if that makes a difference. Although I generally agree with recommendations to firmly stick with VSAN HCL, I haven't run into any problems at all from P3700s on any firmware versions. With no negative impacts, I'm not seeing a reason to downgrade the stock 8DV10171 firmwares that are shipping with these disks. That said I'm not seeing a reason to upgrade to newer firmwares either when they're released since they probably won't be qualified on VSAN HCL yet.

I'm also not aware of any way to downgrade the firmwares from higher versions to the 8DV10131 that's on HCL.

Reply
0 Kudos
boomboom21
Contributor
Contributor

We're having the same problem with the HHHL P3700 drives.  We're not experiencing any issues other than the Health service saying the drives aren't on the HCL.

I've opened case with VMware who told me it was Intel's responsibility to make sure HCL is correct.  I've had case open with Intel (Case 00288969) for a couple weeks now with no progress.  They've had me upgrade to the lastest driver (1.0e-2.0-1OEM.550.0.0.1391871) and firmware from 8DV10131 to 8DV10171.  No luck.  

Maybe it'd be good for you to also open a case with Intel and reference my case so they can see it's affecting more than one person.

Reply
0 Kudos
KurtDePauw1
Enthusiast
Enthusiast

Same problem with DC P3600

It is on the HCL but showing up like it isn't in the HCL.

Last firmware, last drivers ...

Also LSI 3008 is in the HCL but not showing up as it is in HCL

Last firmware, last drivers ...

Reply
0 Kudos
depping
Leadership
Leadership

Actually with regards to flash devices and drives the statement is that there is a minimum level of firmware which is on the HCL, anything higher is supported as far as I know. I will ask the engineering team to bake this logic in to the health check HCL team.

EDIT: Apparently this does not apply to the Intel P3700 devices, what is listed on the HCL is a hard requirement, so please do not use a higher version!

Reply
0 Kudos
andyarnet
Contributor
Contributor

The driver we are using is 1.2.0.27-4vmw.550.0.0.1331820, and have had no problems as far as I can tell.  We started off with 1.0e.1.1-1OEM.550.0.0.1391871 which is the driver listed on the HCL and had all sorts of problems.  It is my understanding that Intel is in the process of recertifying the P3700/P3600 with the updated firmware.

I've been told by VMware support that they will support these drives with this driver/firmware combination, so we have moved the cluster into production.  I would love to see the HCL warning go away soon though Smiley Happy

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

Out of curiosity, what kind of problems were you seeing before updating the driver?

Also what version of VSAN?

Reply
0 Kudos
andyarnet
Contributor
Contributor

We were seeing congestion errors on the SSDs while running stress tests for any more than a couple minutes.  High latency and just crappy performance in general.  Intel told us it was due to the driver not matching the 8DV10171 firmware.

We're running 6.2, and performance is looking really good at this point.

elerium
Hot Shot
Hot Shot

I am actually seeing the same performance/congestion related issues on my 6.2 lab, specifically with write performance, while everything is working perfectly in 6.1   When I disabled the new 6.2 checksum feature in storage policies it went away, but I'd rather have that option enabled on my clusters.

I'll give the driver update a try!  Thanks for sharing.

Reply
0 Kudos
KurtDePauw1
Enthusiast
Enthusiast

A fast question ...

How do I get the firmware version of the P3600 800GB SSD ?

We to see a lot of latency sometimes +350ms

We are using driver version : 1.0e.0.35-1vmw.

But if you would like to go to version : 1.2.0.27-4vmw

you have to be on Firmware version :  8DV10171

So I would like to check the firmware version of the SSD so i can upgrade that first.

Here are the warning I get from Vmware although they are in the HCL

   

DeviceDriver in useDriver health
vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Expresslsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758)Warning
vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
vmhba3: Avago (LSI Logic) / Symbios Logic Avago (LSI)3008lsi_msgpt3 (12.00.00.00-1OEM.600.0.0.2768847)Warning
vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Expresslsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758)Warning

Thanks in advance

Reply
0 Kudos
andyarnet
Contributor
Contributor

You can install the SSD Data Center Tool VIB and use it to find the firmware version.  Although the easiest way would be to pull the drive, the FW version is printed on the drive (at least it is on our P3700's)

elerium
Hot Shot
Hot Shot

The 1.2.0.27-4vmw.550.0.0.1331820 driver did significantly improve congestion and improve latency in general over the intel-nvme drivers. I'm still seeing an issue where sequential writes and limited to no higher than 250MB/s from VM guests, but only with checksum enabled (disabled i get 800MB+ write speed). Maybe a raid controller or raid driver as I'm using Dell/H730 which isn't on HCLed for 6.2 yet, latest I heard from support is that Dell/VMware may have my raid controller added to 6.2 HCL by end of May.



Reply
0 Kudos
amurrellwsu
Contributor
Contributor

Has there been any traction on this? I'm also hitting up Intel on their end (Firmware Downgrade |Intel Communities) with the same issue to see if we can push this along. According to Intel, the certification validation lies with VMWare at this point. From what a VMWare Federal Escalation Engineer told me during a call for an unrelated service request, VMWare can either certify in-house OR request results from the hardware company to analyze for certification.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I have a mix of VSAN 6.1/6.2 hybrid and all flash clusters, all using P3700 or P3600 for cache. I can tell you that in VSAN 6.2, using the HCL firmware/driver combo you would see very poor performance, congestion and latency problems. I recently built a new 6.2 VSAN all flash cluster that happened to ship with 8DV10131 (HCL) firmware. Using 1.0e.1.1-1OEM.550.0.0.1391871 HCL driver, there are severe write performance related issues. Result is the same after upgrading to firmware 8DV10171. Between VMWare and Intel or whoever is responsible for updating the HCL, I don't think any real testing went into it before it got 6.2 qualified. Testing the HCL combo even for 5 minutes, one would immediately notice a major latency/congestion issue, on even light stress testing. I believe it also has something to do with the new checksum functionality added in 6.2, if disabled in the storage profile, all performance returns back to normal levels.

Personally I don't think it's an issue of downgrading firmware but for Intel to release a new inte-nvme driver (and or firmware update) that resolves issues discovered for version for 6.2.  Also none of the issues exist on 6.1 (probably because checksum feature isn't on 6.1).

Here are my findings from 6.2 AF VSAN using Intel P3700 400GB for write cache and 4x Intel S3510 800GB for capacity:

P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.1.1-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.2.0-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.1.1-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.2.0-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

P3700 400GB, firmware 8DV10171, nvme 1.2.0.27-4vmw.550.0.0.1331820 driver - no latency/congestion problems, sequential writes limited to 250MB/s, no issues if disabling checksum

elerium
Hot Shot
Hot Shot

I posted the same response to the Intel forums, I'll probably open a case with Intel in the next day or two hopefully will get more visibility to Intel as well.

Reply
0 Kudos
depping
Leadership
Leadership

Are you running benchmarks or is this during normal operations? Also, have you opened up a VMware support ticket for this issue? If so, what is the SR number?

Reply
0 Kudos
amurrellwsu
Contributor
Contributor

Elerium,

Can you confirm that your S3510 SSDs are not contributing to the problem by swapping them for something else? My cluster consists of 4 x Dell PowerEdge R730 servers each with 2 disk groups running VSAN 6.1 hybrid. On three of the nodes, the disk groups consist of an Intel S3700 800GB for write cache and 6 x Seagate Constellation.2 (ST91000640SS) 1TB hard drives for capacity. The remaining node has Intel P3700 800GB PCIe drives instead of using the Intel S3700. I'm currently running 8DV10171 with nvme 1.2.0.27-4vmw.550.0.0.1331820, so I don't believe I'm seeing any of the issues you've described.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

Thanks Duncan for checking my SR. I received your PM and don't have other recommendations yet, will wait and see if support has additional suggestions. The congestion/latency issues I describe are for normal operations, if running a benchmark, the issues appear quickly within 5-10 minutes.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I am sure the S3510s are not the issue, I have the same thing happening on my hybrid cluster that uses WD RE4s as capacity drives.

You mention you are on VSAN 6.1, the issue I'm describing occurs only with VSAN 6.2 (probably related to the checksum feature). In VSAN 6.1 or VSAN 6.0 I didn't experience any issues with P3700/P3600 on any cluster.

Reply
0 Kudos
eode
Enthusiast
Enthusiast

May I ask - which model/type of the Intel DC P3700 are you guys running? While searching through the VSAN HCL DB - the "cool way" (JSON-file directly - http://partnerweb.vmware.com/service/vsan/all.json‌‌) - I've found that the SSID of the SSDPE2MD800G4 (800GB, 2,5-inch) is listed as SSID 3703, and the SSDPEDMD800G4 (800GB, HHHL AIC) also has SSID 3703. Our DC P3700, 800GB, HHHL AIC has SSID of 3702, not 3703, which probably is the reason why the Health Check gives us a "Warning" (does not match any SSID. Different driver or firmware will in that case give the same result, as it still doesn't match any SSID).

Regarding identical SSID in the HCL

A quick search in the JSON-file, and you'll find the following relevant IDs (output from today - this may change):

"id": 39653,
"model": "Intel SSD DC P3700 Series SSDPE2MD800G4 (800 GB, 2.5-inch)",
"vid": "8086",
"did": "0953",
"svid": "8086",
"ssid": "3703",


"id": 39659,
"model": "Intel SSD DC P3700 Series SSDPEDMD800G4 (800 GB, HHHL AIC)",
"vid": "8086",
"did": "0953",
"svid": "8086",
"ssid": "3703",

Checking Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID) and Sub-Device ID (SSID)

vmkchdev -l |grep vmhba4

0000:84:00.0 8086:0953 8086:3702 vmkernel vmhba4

Regarding driver & firmware-versions

Our NVMe-device is also shipped with FW 8DV10171 (verified with the Intel DCT).

Based on our VID, DID, SVID and SSID for the device, the following HCLs is available:

From the "General HCL"-list (cat=io): FW 8DV10171 & nvme version 1.2.0.27-4vmw (VMware Async)

From the "VSAN HCL" (cat=ssd): FW 8DV10131 & nvme 1.0e.1.1-1OEM.550.0.0.1391871 (which actually is "intel-nvme", as this is only available as Partner Async-driver, as far as I know).

Still waiting for a response on our SR - just wanted to let you know our findings (in case it helps).

Best regards,

Espen Ødegaard

Reply
0 Kudos