VMware Cloud Community
riotsoho
Contributor
Contributor
Jump to solution

Intel P3700 Drivers for VSAN?

I worked with our VMWare team to design a VSAN out and it included some Intel P3700 PCIe cards as the flash tier backed by some Seagate 1.2TB 10k sas drives for storage.  Unfortunately, the P3700's still haven't gone onto the HCL and I'm currently "chasing my tail" with Latency and Outstanding IO.  I opened up an SR, but the recommended drivers for the H730 didn't appear to do anything.  And I'm fairly certain that's not the issue (destaging doesn't seem to be a problem).  I understand that the P3700 didn't pass the HCL yet, but it's still in the process and so many white papers were written with this card as a part of the architecture.

This is all running VMWare 6.0b (VMware ESXi 6.0.0 build-2494585 on hosts)

Hardware:

-Dell r630's (30 of them)

-2x Intel P3700 400Gb Flash cards (intel-nvme-1.0e.1.1-1OEM.550.0.0.1391871.x86_64.vib driver)

-10x 1.2TB Seagate 10k SAS drives

-Dell H730 RAID Controller in HBA mode with cache disabled, etc. (drivers for this were a worry, but since I assume that first writes go straight to the PCIe cards, I don't think this is the issue any longer - firmware 25.3.0.0016 driver 6.606.12.00)

-Intel X710 with 2x 10Gb DAC connections

-4x Juniper QFX5100 10Gb Switches (15 r630's per 2 switches, scaling up to 30 servers per 2 switches eventually)


SAN Traffic goes over Uplink 2 (with uplink 1 as standby), everything else is over Uplink 1 (with uplink 2 as standby) - working on a plan to move to LACP, but it's still in the works


The current workload is very low.  There are only about 6 Production VM's running on this cluster for baselining and kicking the tires, with the idea that 500-1000 VM's will be spun up over the next few months.  For the majority of VM's, performance is not crucial, but some of the issues I am currently seeing are a bit of a showstopper.


Problems:

If I run "bonnie++ -u root" on a single VM, I can see latency go up to 65,000ms (yes, really 65k ms) and the VM basically becomes unresponsive (100% iowait, and very rarely is able to write IO because of the huge latency).  The write buffer never gets very full during this period (stuck at 30%, and destaging doesn't even start during the run.  Similar issues happen if I run ATTO Disk Benchmark on a windows system with a high disk queue (4 appears to be fine, 10 kills the VM)


I can get very high write speeds (500-800MB/s or more), but as soon as the latency jumps up over a few 100 ms, it's all downhill.


Even with a fairly simple Logging VM with all our hosts pointed at it, I get occasional latency spikes (1400ms+, with an average of 15ms which seems high even).  This box just runs a lot of writes to logstash and an elasticsearch index with the occasional reads when kibana is showing someone something.

Are there any special drivers that I can get from somewhere for the Intel P3700's?  Anything else I should really look into?  I'm tired of chasing my tail and want to start migrating actual load to this new cluster.  I've tried RAID0 on a smaller cluster of 4 boxes, but that wasn't any better, and is way more annoying.

1 Solution

Accepted Solutions
elerium
Hot Shot
Hot Shot
Jump to solution

I'm using P3700 1.6 & 2.0 TB cards in my VSAN without issue. I have a fairly similar setup, except using r730xd, same raid card with same firmware+drivers, also on 6.0b. When I originally set it up, i was seeing high latency spikes of 200-400+ms, it was fixed by updating NIC drivers for me. I'm using a different NIC (Intel X540-AT2), but updating firmware+driver for my NIC brought my latency to ~3ms avg and occasional peak blip at ~15ms. Probably worth a shot to update any firmware on your X710 and use the corresponding driver on VMWare HCL VMware Compatibility Guide: I/O Device Search

I have also seen really bad latency issues come up from network misconfiguration. In our case we had a 1gb failover link in case our 10gb failed, but instead things got set to load balanced and performance/latency was very poor until we noticed that the 1gb was being fully utilized.

Also can try using the Inbox NVME driver, I was running on that without issue before Intel released the 1.1 driver that you're using now. From my limited testing, the Intel drivers are slightly faster in performance but nothing majorly different.

View solution in original post

Reply
0 Kudos
7 Replies
zdickinson
Expert
Expert
Jump to solution

Intel P3700 for VSAN

That very long thread comes to the same conclusion as it seems you have.  There is a driver issue and it's being worked on by VMware and Intel, but there is no fix.  I might give the new 6.1 a try.  Thank you, Zach.

Reply
0 Kudos
riotsoho
Contributor
Contributor
Jump to solution

Yea, these things are killing me.  We're getting better io latency from some fairly standard 300GB SSD's on another cluster running VSAN in RAID0 mode.  I want these cards to work so bad, but the drivers just aren't there yet - I'm keeping my fingers crossed for some better drivers soon.

Reply
0 Kudos
wreedctd
Enthusiast
Enthusiast
Jump to solution

Subscribing.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot
Jump to solution

I'm using P3700 1.6 & 2.0 TB cards in my VSAN without issue. I have a fairly similar setup, except using r730xd, same raid card with same firmware+drivers, also on 6.0b. When I originally set it up, i was seeing high latency spikes of 200-400+ms, it was fixed by updating NIC drivers for me. I'm using a different NIC (Intel X540-AT2), but updating firmware+driver for my NIC brought my latency to ~3ms avg and occasional peak blip at ~15ms. Probably worth a shot to update any firmware on your X710 and use the corresponding driver on VMWare HCL VMware Compatibility Guide: I/O Device Search

I have also seen really bad latency issues come up from network misconfiguration. In our case we had a 1gb failover link in case our 10gb failed, but instead things got set to load balanced and performance/latency was very poor until we noticed that the 1gb was being fully utilized.

Also can try using the Inbox NVME driver, I was running on that without issue before Intel released the 1.1 driver that you're using now. From my limited testing, the Intel drivers are slightly faster in performance but nothing majorly different.

Reply
0 Kudos
riotsoho
Contributor
Contributor
Jump to solution

After upgrading the Intel x710 drivers to the latest (i40e_1.3.38-1oem.550.0.0.1331820) i've been having much better results.  I haven't been able to kill a VM with bonnie++ so far.

Reply
0 Kudos
JohnNicholson
Enthusiast
Enthusiast
Jump to solution

Stayed tuned here.  Should hopefully have something soon. (And we will get a new category on the VSAN VSG for NVMe)

Reply
0 Kudos
arielsanchezmo1
Enthusiast
Enthusiast
Jump to solution

I just received this from my Intel rep

https://communities.intel.com/community/itpeernetwork/blog/2015/12/01/intel-is-the-first-to-official...

They are now officially supported for VSAN.