frivers
Contributor
Contributor

Everytime AV Server Push Out Updates, VMs Network Connectivity Gets Spotty

We're currently running two ESXi 5.1 hosts both of which use teamed 1 GB connections for the VM network.  We have about 60 VMs across the two servers and every time the Vipre AV server pushes out updates to almost all VMs at once, the network connections (file, http, etc) slow down and sometimes drop off for seconds at a time altogether.  So I thought the links were being saturated and tried to verify this in the Performance tab in vCenter.  However, what I saw there perplexed me.  I'm not even saturating half of my connection.  The max is probably about 400 MBs.  Now last week I noticed that one of the vmnics on one of the hosts was running in 100 MB mode.  I assumed this was causing the issue.  I simply unplugged the link plugged it back in and the vmnic connected back at 1 Gbps and has stayed there ever since.  However, I'm still getting the slow VM performance during the definition updates.

This doesn't make sense to me.  I would think I needed to tune the AV Update settings, but I'm not saturating my nics at all.  What could be the issue?

0 Kudos
19 Replies
vmroyale
Immortal
Immortal

Have you checked other resource usage to see if there are problems there? With this "storm" it could be a number of things happening to your resources.

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
0 Kudos

Have you checked with your AV Vendor to see if they have any recommendations for pushing updates to VMs?  Did you look at the CPU/Memory usage of the hosts/VMs when this storm occurs? 

Ben Liebowitz, VCP vExpert 2015, 2016, & 2017 If you found my post helpful, please mark it as helpful or answered to award points.
0 Kudos
frivers
Contributor
Contributor

vmroyale wrote:

Have you checked other resource usage to see if there are problems there? With this "storm" it could be a number of things happening to your resources.

Nothing notable with the other stats.  Memory looks normal and so does CPU.  I'll look more into disk usage.  However, the reason why I focused in on the networking is the vmnic issue the previous week.  I was having the same issues and noticed that one of the vmnics was running at 100 Mbps.  I thought fixing this issue would resolve the problem, but it has not.

0 Kudos
frivers
Contributor
Contributor

BenLiebowitz wrote:

Have you checked with your AV Vendor to see if they have any recommendations for pushing updates to VMs?  Did you look at the CPU/Memory usage of the hosts/VMs when this storm occurs?

This is the first thing I tried.  The AV never presented a problem before, but I made some tweaks anyways.  The AV server sends updates out in 60k packets every 100 ms.  I changed the interval for most servers to 200 ms.  This morning I still had that issue.  Yesterday I didn't have an issue.

0 Kudos

What about CPU/Memory utilization on the hosts when the updates are pushed out? 

Can you (or have your network team) check the ports on the physical network switch for errors?

Ben Liebowitz, VCP vExpert 2015, 2016, & 2017 If you found my post helpful, please mark it as helpful or answered to award points.
0 Kudos
frivers
Contributor
Contributor

In addition, this really started happening when I updated the vmtools for a large batch of vms that were never updated after an upgrade from ESX 4 to ESX 5.1.

0 Kudos

have you also upgraded the virtual hardware?  what about upgrading the datastore(s) to VMFS5? 

Ben Liebowitz, VCP vExpert 2015, 2016, & 2017 If you found my post helpful, please mark it as helpful or answered to award points.
0 Kudos
frivers
Contributor
Contributor

Ben,

I haven't done either of those things.  Thanks for the feedback, guys.  Trying to make sense of this intermittent issue.

0 Kudos

The intermittent ones are always the hardest. 

Good luck! 

Ben Liebowitz, VCP vExpert 2015, 2016, & 2017 If you found my post helpful, please mark it as helpful or answered to award points.
0 Kudos
frivers
Contributor
Contributor

I was just browsing my AV vendor's support forums and noticed that some people were having similar performance issues with the current agent.  Once that's released next week, I'll give it a limited roll out and see how it does.

0 Kudos
jdptechnc
Expert
Expert

Things like mass A/V updates, Windows Update downloads/installs, etc are bound to cause performance issues if the typical performance footprint is near a potential bottleneck.  Most recently for one of my environments, extremely high storage latency was encountered from the automatic quick scan that is initiated when definations are updated, because every VM was issuing a high number of read OPS at the same time, which caused many of the guest OSes to stop responding on the network briefly because they couldn't access their disks quickly enough.  It wasn't a networking issue at all, it was just the storage not being able to keep up.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
frivers
Contributor
Contributor

jdptechnc wrote:

Things like mass A/V updates, Windows Update downloads/installs, etc are bound to cause performance issues if the typical performance footprint is near a potential bottleneck.  Most recently for one of my environments, extremely high storage latency was encountered from the automatic quick scan that is initiated when definations are updated, because every VM was issuing a high number of read OPS at the same time, which caused many of the guest OSes to stop responding on the network briefly because they couldn't access their disks quickly enough.  It wasn't a networking issue at all, it was just the storage not being able to keep up.

I had this issue in a previous environment, but my disk activity during peak times is never higher than 50 Mbps per vCenter.  Maybe that's the issue, but I do have teamed 1 GB nics for storage on both my NICs so I can't see how that would be an issue.  Or perhaps there is an issue and I don't know how to determine any storage related issues.  On top of that, it's intermittent.  For example, it hasn't happened today (yet) and I'm copying 200 GB of files from one server to the other and there was nary a hiccup during a mass definition update a few minutes ago.

This made sense when one of my VM network vmnics was running at 100 Mbps, but now that it's back to 1 Gbps the sporadic hiccups are perplexing.  I also have a backup agent on each machine that runs a snapshot at various times throughout the day.  Maybe a snapshot and a definition update at the same time is causing an issue.  I'm still investigating. Again, thanks for the input.  All of this is extremely valuable.

0 Kudos
jdptechnc
Expert
Expert

If you can catch it within an hour of the time it happens today, look at the Datastore latency on the chart (I believe you can only view this on the Performance chart on the Real Time interval.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
frivers
Contributor
Contributor

I'm seeing short spikes in disk latency of 2-4 seconds every few hours.  Is this normal from your perspective?

0 Kudos
jdptechnc
Expert
Expert

2-4 seconds, or 2-4 miliseconds?  2-4 seconds latency is very poor and would indicate you have a disk bottleneck.  Do you see any events on your hosts regarding datastore latency and/or lost access to datastores during this time?

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
frivers
Contributor
Contributor

vCenter is showing between 2,000 to 4,000 ms on one server and 2,000 to 6,500 ms on another.  The 100 Mb vmnic thing must have been a coincidence though.  The frequency of the latency spikes correspond with the interval where 30 + vms are writing to disk all at the same time.  I'm only using two out of four of my iscsi nics assuming that I had enough bandwidth, but I didn't think about latency.  Perhaps adding more nics to the equation will solve my issue?  Will try and let you guys know the result.

The spikes are short and so I don't lose access to datastores during this time.

0 Kudos
jdptechnc
Expert
Expert

You might want to check the storage system itself, if you have any way to track performance there.  Cache hitting a high watermark and having to flush to disk constantly could cause the entire system to suffer, for example, as well as the disks themselves being maxed out performance wise.  It's possible that it just can't accept the requests as quickly as your VM's are sending them during these times.

Is this storage system only being used by your VMware environment?

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
frivers
Contributor
Contributor

I have my storage system tiered between a high capacity/low speed tier and a low capacity/high speed tier.  Most of my servers (File servers, app servers, etc) are on the slow storage.  That may be the issue.  I was saving the storage for SQL.  Would a storage vmotion to the fast storage work?

0 Kudos
jdptechnc
Expert
Expert

That might fix your latency problem, sure.  If the problems are affecting your applications and/or end users, maybe move the VMs related to those first.  Otherwise, it may not matter (if a tree falls in the forest and no one is there to hear it...).  You said you needed to keep some space on your high performance tier for SQL Server, so that's a consideration too, of course.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos