Re: Performance issues after upgrade (P4500 ESXi 5...

Noyzyboy · ‎08-26-2012

Hi all,

I was wondering if some of the large brains out there could help me out with a performance issue we are having. Here's a brief history.

For the last 3 years we've been running a couple of DL380 G6 servers connected to a pair of P4300 via two stacked Cisco 2960S switches. We were running ESXi 4 on the servers installed on a mirrored pair of disks.

Recently, we bit the bullet and to increase both disk and RAM capacity we purchased two P4500 nodes and two new DL380 G7 servers (2 Proc, 96GB RAM in each). We arranged an outage and using the cluster swap feature we moved all the LUNs from the P4300s to the P4500s and then relocated the VMs to the new servers, these are running ESXi 5 installed on an SD card. All VMs had the VMware tools upgraded at the time.

Since the day we moved everything over we have had intermittent performance issues which manifest themselves as a short (10 second) disconnect for anyone who is using a Terminal Server session on the servers, they happen two to three times a day and without any seeming pattern.

Other than the changes above there have been no other changes in the environment and we are struggling to understand what is causing the problem since the new equipment should be much more capable than the old.

I have had to reconfigure the iSCSI set up due to the changes in VMware version. For the iSCSI, I have two dedicated NICs per server configured as a pair with a single vSwitch configured, there are then 2 port groups for iSCSI traffic each with a dedicated NIC and the other NIC set for "Not Used". I've also been through and configured multipathing for Round Robin. I'm not using jumbo frames and did not in the old setup also.

The Cisco 2960S are used for both VM and iSCSI traffic, I realise this is possibly not ideal but it worked fine before and if I look at the switch stats they are barely being used. All traffic is VLANed

I have split up one of the larger LUNs into three and reformatted with VMFS5 to see if that resolved the problems but it appeared not to.

I installed Veeam ONE monitoring and I often see these error messages in the Events log although they sometimes match up with the disconnect issues and sometimes don't but I get the feeling they are related.

"Device naa.6000eb3a617061706250000000000013b performance has deteriorated. I/O latency increased from average value of 4647 microseconds to 246429 microseconds"

VMware tech support told me any latency above 10 milliseconds is not good so I'm thinking this is REALLY not good.

Does anyone have any thoughts why we might be having these problems or any ideas how we might start troubleshooting. I have a call logged with VMware but we don't seem to be getting anywhere and they are starting to point fingers at the P4500 although the hardware supplier is saying it's not possible.

If you need any more information, please let me know.

Thanks in advance,

Rich

Phoenycks · ‎10-16-2012

Maybe I can bump this a bit.

I have almost exactly the same issue, except I'm using the P4300's still. We're using View and have the same problem - random disconnects and IO Latency issues.

Recently we bumped up from 2 P4300 nodes to 4, and restriped, thinking it was a storage bottleneck. This has not resolved the problem.

The disconnects only started occuring after upgrading from ESXi 4.1 to ESXi 5.

There must be some compatibility issue somewhere, or configuration issue, or default setting that we haven't found that's the cause of the disconnects.

It's completely perplexing, and we also had a ticket with VMware but were unable to find a problem, with the exception of the latency errors.

However, even though it feels like it could be related, I'm not convinced that the latency issues are causing the disconnects. Whenever a user gets bumped off, they can immediately reconnect and pick right back up where they left off.

This feels like a networking issue to me - I think the latency issues might be an additional result of the problem, not a cause of it. We are also using the DL380 G7's, and an older ML370. The disconnects happen on all 3 hosts.

Either this is an HP problem, or a VMware networking problem, but something's wacky and I haven't been able to find any problems, just effects.

We use the DL380's, an ML370, the P4300's, and Procurve 2910al switches. The DL380's run ESXi from an SD card, the ML370 runs from a local RAID 1. Again, the disconnects occur on all 3 hosts.

I've tried moving a couple of guests to local storage to eliminate the SAN, but they too get disconnected. I really think this is network related.

Anyone else? Any insight?

Jes

Noyzyboy · ‎10-16-2012

Hi Jes,

Thanks for your input. We seem to have had some progress on resolving this issue but I'm not 100% sure yet as it is so intermittent.

Originally all traffic (storage and VM traffic) was going over a pair of stacked Cisco 2960S switches. I've since broken the VM traffic off to an HP ProCurve switch and the issue appears to have almost completely vanished (it's difficult for us to know exactly as people have got sick of telling us so we don't always hear...)

I was wondering myself if the issue might be related to ESX5i as this was the one thing I chaged which was a bit of an unknown. It's going to be tricky to go back though since we have upgraded all our VFMS volumes to VMFS5 now and don't really have the space to recreate all the volumes again and move the VMs over using Storage vMotion.

Anyone from VMware monitoring this and aware of an issue?

Cheers,

Richard

Phoenycks · ‎10-16-2012

We have all traffic flowing over stacked HP Procurve 2910al's, and they are uplinked to Cisco switches for distribution to the desktops.

The iSCSI traffic is VLAN'd off and isolated, but unfortunately I don't have another switch to use to completely isolate everything physically.

Downgrading is not an option - we have everything maintained at the latest revisions, updated monthly, and we're using View 5.1.1 so there's really no going back.

I'm a little surprised that you have done so well minimizing the issue by moving the VM traffic, but at its core I still think there's a problem with the networking.

A note (especially if VMware see this) - we did update the NIC drivers with assistance from VMware. We've reconfigured the switches and vSwitches several times trying different load-balancing configurations. None of these changes has made any difference.

If I eliminate the NIC drivers (they were both Intel and Broadcom, and both have the latest drivers and firmware), eliminate the switches (since this seems to occur on both HP and Cisco switches), eliminate the vSwitches as we've tried multiple configurations with no change in results, what's left? The VMware kernel? This has to narrow it down to a very few things that can still be the cause...

Consider that we see random disconnects and random alarms for IO latency. The SAN (HP P4300's) is iSCSI. Any and all traffic across the NIC's experiences random bottlenecking or throttling, causing the disconnects when in the View environment and high IO latency in the iSCSI environment.

What's the commonality where the problem would be?

Jes

Noyzyboy · ‎10-16-2012

I have to admit I'm getting more and more convinced this is a VMware ESX5 issue because during our upgrade process all the hardware was upgraded to hardware of way better spec.

It's been incredibly frustrating and I've had calls open with both VMware and HP to try and resolve this but both like to point the finger at the other.

I was also surprised thaqt shifting the traffic to another switch had such an impact to be honest it was a last ditch attempt because I was running out of things to try.

It must be said, it makes me feel slightly better that someone else is having the same issue.

Thanks,

Richard

Phoenycks · ‎10-19-2012

Any luck? I think I'm going to take this thread an open another case with VMware. There has to be a fix or configuration that will resolve this issue.

I'd like to think that considering the issue only appeared with ESXi 5, we should be able to point the finger pretty solidly at VMware...

Have you heard of any other people having this problem or something similar? I'm going to search the forums when I can...

Jes

Noyzyboy · ‎10-20-2012

No luck from this end. I've been battling with this for about 3 months and once I put the other switches in and things calmed down I kind of left it on the back burner.

Would love to know if you get it sorted out and also happy to help if VMware need another example of the issues, so please let me know.

Cheers,

Richard

WCTech · ‎10-22-2014

Sorry am only dragging this up in the hope that the original authors are subscribed (and therefore get notified) and respond.. we are having similar issues and I was hoping you could give some advice on how you solved these problems?

thanks!

All

Performance issues after upgrade (P4500 ESXi 5)