Expert
Expert

FT CPU Spikes and Latency

Jump to solution

What kind of latency should be expected when running FT protected guest?

I have a new 3 node cluster using DL380 G9 servers running 6.0 with the latest patch from April. Nothing is running on the cluster other then a vCenter guest and 1 windows 2012 server i'm using for testing FT. When I enable FT on the guest, it becomes sluggish and the guest CPU spikes when doing anything, opening windows, moving around, etc. It's ping response time when FT protected is usually is normally 50ms or greater and you can see it increase when the guest cpu spikes as a result of opening windows.

Is this normal? I don't think this kind of slowness and latency would be acceptable for production applications, am I missing something?

the guest is 2012 with 2 vCPU and 8GB memory

The Vmotion and FT portgroups have 2 nics dedicated to each of them and jumbo frames enabled. When monitoring the network connections the traffic doesn't look anywhere near saturating the network. Host utilization is near 5% on all nodes.

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
1 Solution

Accepted Solutions
Expert
Expert

After testing with 10GBe nics for the FT and monitoring bandwidth we determined that 10GB is required even though the documentation states that it's only recommended. They changed the underlying technology that runs FT in vSphere 6.0 from vLockstep to something called "Fast Check-pointing". Monitoring the FT pNics using esxtop on an old 5.5 system thats using the old technology(vLockstep) and then taking that same guest and using FT on 6.0 while monitoring with esxtop again, we were able to see that it uses way more bandwidth to do the same thing.

So the real issue was not that something was mis-configured but that the documentation needs to read that 10GB is "required" and not "recommended".

Working with VMware we also discovered that they wouldn't escalate the ticket to engineering unless we were using 10GBe links for the FT in 6.0, this would make you assume its a requirement and not a recommendation. The documentation is misleading and hopefully this post helps someone in the future and that VMware updates the requirements documentation.

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

View solution in original post

0 Kudos
6 Replies
Leadership
Leadership

10GbE NICs?

0 Kudos
Contributor
Contributor

i have the same servers with same ft problem, even ping rises to 100 -150 mls.  when normaly it is under 1 mls. in legacy mode it is working fine but in this case i am limted to 1 vcpu

0 Kudos
Expert
Expert

After testing with 10GBe nics for the FT and monitoring bandwidth we determined that 10GB is required even though the documentation states that it's only recommended. They changed the underlying technology that runs FT in vSphere 6.0 from vLockstep to something called "Fast Check-pointing". Monitoring the FT pNics using esxtop on an old 5.5 system thats using the old technology(vLockstep) and then taking that same guest and using FT on 6.0 while monitoring with esxtop again, we were able to see that it uses way more bandwidth to do the same thing.

So the real issue was not that something was mis-configured but that the documentation needs to read that 10GB is "required" and not "recommended".

Working with VMware we also discovered that they wouldn't escalate the ticket to engineering unless we were using 10GBe links for the FT in 6.0, this would make you assume its a requirement and not a recommendation. The documentation is misleading and hopefully this post helps someone in the future and that VMware updates the requirements documentation.

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

View solution in original post

0 Kudos
Expert
Expert

yes we were using 1GB nics originally because the documentation only said 10GB was recommended. This was a huge deployment for my client and the VMware partner who created and approved the design said it should be fine on the 1GB links. I was engaged to do the actual build and raised a flag about it but got over ruled by a bunch of partner architects with lots of certs after their names. I think this was an issue of using bleeding edge software and the architects hadn't actually worked with the new software.

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
VMware Employee
VMware Employee

Hi Bradley,

Can you please point me where you saw that 10GbE was only recommended, and not required?  I want to make sure this gets corrected.  As you already mentioned, FT can be really "bursty", exceeding 1GbE speeds pretty quickly.  In some of my tests these spikes can even occur within fractions of a second, so you'll never capture them using conventional tools like ESXTOP.  When the FT network cannot keep up, it results in latency within the app.

I double checked just to be sure, and the official documentation does state that a 10GbE dedicated network is required.

vSphere 6.0 Documentation Center - Fault Tolerance

Expert
Expert

Agree on the capturing bit. We were just watching packets and transmits on the FT pnics with esxtop and calculating the average throughput and determined that it was maxing out the 1gb link and the server was super slow opening windows and having high cpu spikes, a symptom of FT slowing the primary down to allow the secondary to catch up.

I think the wording is a little ambiguous in the docs you linked to, instead of saying "Use a dedicated 10-Gbit logging network for FT and verify that the network is low latency." maybe drop the "use" and just have "A dedicated 10-Gbit logging network for FT and verify that the network is low latency.".


I say its ambiguous because further down in the same document is a comparison table where is says "Dedicated 10-Gb NIC recommended", found here vSphere 6.0 Documentation Center - Differences Between Legacy FT and FT another issue leading to confusion is if you google 6.0 FT there are lots of sites stating that 10GB is recommended, this is a result of early previews of the product in February that everyone blogged about and we can't do much about those... Hopefully some of them will revisit FT and update their posts after testing.

So with the above comparison in the differences between legacy FT and FT in the docs, plus tons of blog posts saying recommended while researching the issue, and my clients VMware partner stating the same thing and pushing 1GB in their design caused me and my client a bit of a headache. We had initially assumed we were configuring something wrong or some other unforeseen issue but once we stated drilling into the network traffic and measuring it we realized it just wasn't going to work on a 1GB link. Hopefully this thread will serve others in the future if they find themselves in this situation.

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".