Re: Cluster Networking Issues

UoGsys · ‎10-04-2012

Hi All,

Just wanted to share a 'fun' situation which has been on-going for the last few months. This case has bounced back and fore between VMware and HP support and I finally think we've reached the end of our tether ! Below is a summary of the problem -

"We recently experienced a loss of network communications from a shared uplink set presented to our c7000
based ESXi cluster. This uplink set hosts all of our virtual machine traffic through multiple VLANs, therefore we lost all communications to our virtual machines. See below for a list of hardware/software -

c7000 blade chasis
OA Firmware - 3.32
10x BL460c G7 hosts
2x VC 8Gb 24-Port FC Module
Firmware - 3.30
2x VC FlexFabric 10Gb/24-Port Module
Firmware - 1.04
Emulex NC553i 10Gb 2-port FlexFabric Converged Adapter
Firmware - 4.0.360.15
be2net driver - 4.0.355.1

In terms of connectivity we use the following -
Single Shared 10GbE uplink for all our virtual machine traffic (per module)
Single Dedicated 1GbE Uplink for ESXi management traffic (per module)
Both delivered via the VC FlexFabric 10Gb/24-Port Modules
Dual 8Gb FC uplinks for storage traffic delivered to blades via Mezzanine cards (per module)

In this incident all management ports to ESXi hosts are available but VMs are not contactable from machines outside of cluster.
VMs are accessible from vSphere console and available but they cannot talk to other networked services on different subnets.
VMs can talk to other VMs in same cluster, on the same host, on the same subnet – this is because communication does not go outside of the Distributed Virtual Switch by design (DvS).
In the end, we could only restore normal operation by powering down the entire chassis."

HPs advice was the stock standard "Update your OA/VC/Blade enclosure firmware to the latest release" (we also updated the ESXi drivers from HP's most recent September release of ESXi 5.0 u1).

We did this as a pilot run at our secondary site which runs the same hardware/software and more importantly had not experienced the above issue. Then two days later we had exactly the same issue at our secondary site. HP faffed, batted it to VMware who batted this case back then we ran out of troubleshooting time and had to restore services...

In my mind this looks like a Virtual Connect issue, but at this point I am not ruling anything out.

IMHO our setup is a pretty common one, so I am curious as to whether anyone has seen or experienced anything similar or can give us some pointers as to where to go next.

At the moment I feel like we're sitting on a ticking timebomb. We have no confidence in either setup as we have not traced the cause of the fault and therefore we have no idea when the next outage will occur. It's worth noting that the primary site was in working operation for 6 months before this ugly issue reared its head. No major updates were applied to ESXi (apart from minor patches) and no driver or firmware updates had been applied to the enclosure.

At this point we're desperate for any help or advice !

Thanks!

Josh26 · ‎10-04-2012

The obvious thing in my mind is...

Consider a single vswitch (our of how ever many you configured). It should think it has two pNICs associated with it. Each of them should go to a different VC. Is this accurate?

What load balancing protocol is setup in VMware?

Does the uplinking switch report any errors on the uplink port?

UoGsys · ‎10-05-2012

Josh26 wrote:
The obvious thing in my mind is...
Consider a single vswitch (our of how ever many you configured). It should think it has two pNICs associated with it. Each of them should go to a different VC. Is this accurate?
What load balancing protocol is setup in VMware?
Does the uplinking switch report any errors on the uplink port?

Correct to the first point. To explain our setup more, we have 1 vSwitch with two physical uplinks associated, both going to a different VC which are used for mgmt traffic. This in turn used a dedicated 1GbE uplink from each VC module.

We then have a DvS setup for VM traffic which uses two shared uplinks setup as trunk ports using dedicated 10GbE pipes and vMotion which uses two adapters which purely exist inside VC.

For LB we use Port ID based on an active/active setup which is per HPs VC cookbook.

No errors from uplink switch. We did see some errors from VC, but support informed us that these are most likely "errors normally caused by incompatibilities between drivers and FW."

ports/port1
IfInDiscards = 15
IfInErrors = 39
StatsCRCAlignErrors = 39
Dot3StatsFCSErrors = 39
Dot3StatsSymbolErrors = 1
Dot3InPauseFrames = 2

/ports/port5
IfInDiscards = 15
IfInErrors = 1
StatsCRCAlignErrors = 1
Dot3StatsFCSErrors = 1
Dot3InPauseFrames = 2

/ports/port9
IfInDiscards = 15
IfInErrors = 178
StatsCRCAlignErrors = 178
Dot3StatsFCSErrors = 178
Dot3StatsSymbolErrors = 6
Dot3InPauseFrames = 2

I'm not a networking bod so I dont know how true this is or whether errors in this small quantity are common.

Thanks again!

UoGsys · ‎10-29-2012

Hi All,

Any more thoughts on this one?

I've been through my vDS and VC config several times now in the hope I'd find something untoward, but nothing.

Has anyone heard of or experienced anything similar to this or have any tips as to what we could do to find a smoking gun?

HP seem to be releasing firmware updates for "HP Emulex 10GbE Converged Network Adapters and 10GbE Network Adapters including server LOMs" like fun so I wouldn't be at all suprised is this is a LOM driver or firmware issue. VC firmware also seems to be released about as frequently as new Apple kit so again this could be a culprit.

Any help or views massively appreciated...

Thanks.

nightrider7731 · ‎02-17-2013

Did you ever get a resolution to this issue? I've just got done dealing with a very similiar meltdown. Luckily it was non-prod, but it took out two stacked enclosures of database VMs. It happened three times in the course of a week and was only solved (temporaily) by rebuilding the domain. This has not affected any other enclosures (15 in total.)

While investigating the issue, with HP's "help", I noticed several thing.

On the second occurance, I noticed the stacking links were done within and to the subordinate enclosure. On the third occurrance, the stacking links AND all internal links (bay 1 x7 to bay 2 x7, x8 to x8) were down.
During the second Virtual Connect (VC) rebuild, all networking went down while applying any profile to one of the bays. Removed the profile and reset the VC resolved the issue in about 10 minutes. This was repeatable at will. To test if this was an issue with the blade or the bay, I swapped blades between the bay and different bay, and tested again. This time I had no problems with either.
After one rebuild, I recovered the primary enclosure and then moved a blade (more memory) from the secondary enclosure into a free slot in the primary enclosure (different blades from previous failure). When I powered the blade on, the networking (all links) went down. I powered down and pull the blade, reset VC, and was back up in about 10 minutes.
During the final recovery attempt, I separated the enclosures into separate domain. However, I still had no internal links.

Since we had issues whether blades were turned on or off, we ruled out VMware and turned our attention to possibly the enclosures firmware. We were running v3.70 on both the OAs and VCs for several months. When I down rev'd the FW to v3.60, our internal links came back up and we rebuilt the enclosures as two separate domains. We're at 27hrs and counting since the rebuild and are crossing our fingers.

Here is the configuration. Both enclosures are identical

All blades: vSphere ESXi v5.0 U2. Patches current thru Dec 2012 release.

Network: one dVS with A and B side pnics. Different port groups on different VLANs for mgmt and data

c7000 blade chasis
32x BL490c G7 hosts (boot from 4GB SD)
OA Firmware - 3.70

2x VC Flex-10 10Gb/24-Port Module per enclosure
VC Flex-10 Firmware - 3.70 (before down rev to v3.60)
4x VC 8Gb 24-Port FC Module per enclosure
VC SAN Firmware - 1.04 v6.1.0_55
Emulex NC553i 10Gb 2-port FlexFabric Converged Adapter
Firmware - 4.1.402.20
be2net driver - 4.1.334.0

Network access is to a pair of Nexus 5K's with two etherchannelled 10GB links to each side. Presented to two pnics per blade. Both enclosures shared this connection

.

Thanks in advanced!

Gkeerthy · ‎02-17-2013

i also faced the same issue in production, all the networks for the vms will lost, and i saw STACKING LINK FAILURE in the VC and there is no disconnection in the storage.

We have opened call with vmware and HP, from the HP VC logs they confirmed they say the NO_COMM error the same symptom and the bug is reported in the HP site, they says it is fiexed in the VC firmware 3.1, but we have 3.6 and we saw the same issue.

If any body has solved this issue please share it in the post, if i get a response from HP i will share

Here is the below details of the issue and my update the the HP analysis.

Hardware Analysis: - From HP

2. So, we saw events from 14:52 and we saw VC + OA communication were up. However we have observed NO COM events being reported for VC BAY1

3. We suspect this to be one of the possible reason, stacking error was observed and lead to NW outage.

As per the below HP report, we also faced the same issue Total Network Outage and Stacking Link Failure status in the VC, and per the below statement from HP it says resolved from VC version 3.17, the current version in QNB HP VC is 3.6 and but unfortunately even the 3.6 version didn’t solved this BUG.

SUPPORT COMMUNICATION - CUSTOMER ADVISORY

Document ID: c02720395

Version: 5

Advisory: (Revision) HP Virtual Connect - Virtual Connect Manager May Be Unable to Communicate (NO_COMM) if DNS Is Enabled for Virtual Connect Ethernet Modules

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02720395&lang=en&cc=us&taskI...

The HP Virtual Connect Manager (VCM) may not be able to communicate (NO_COMM) with Virtual Connect (VC) Ethernet modules in an HP BladeSystem c-Class enclosure or multiple enclosures that are part of the same Virtual Connect Domain.

Customers particularly susceptible to this issue have VC Modules with management IP Addresses configured in the 10.x.x.x range and configured for DNS. When this problem occurs, the VC Manager will still be accessible, but all VC Ethernet modules in the domain will be displayed with an Overall Status of "No Communication." The Virtual Connect Domain will show a "failed" status, stacking links will show "failed" and Profiles and Networks will show a status of "Unknown." In addition, the following error messages may be displayed when clicking on Domain Status from the Virtual Connect Manager Web Interface or when issuing the VC CLI command "show status":

While in the NO_COMM state due to the DNS issue, the customer will not experience a VC network outage and they will still be able to pass traffic. However, if DNS environment changes cause the system to regain communication, the VC network may experience a temporary VC network outage of a few minutes. Subsequently, if the system loses communication, the customer may experience a persistent VC network outage until communication returns.

This issue is resolved with Virtual Connect Firmware version 3.17 ( or later). VC 3.17 is available as follows:

In the HP Virtual Connect 3.70 Release Notes also says, NO_COMM issue fixed.

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)

nightrider7731 · ‎02-18-2013

Just found a document describing a new VC firmware v3.72. Its primary feature is for Pause Flood Port protection. We had some issues with high pause frames and suspect this was a possible reason. This was fixed in v3.17, but obviously dropped off other releases. Unable to track down the software yet, so maybe it's close to being released

ftp://www.compaq.com/pub/softlib2/software1/pubsw-linux/p1881794128/v83835/VC_Virtual_Connect_3.72_R...

gman18480 · ‎02-18-2013

What is you fault detection setting on your vs witch. Is it beacon probing or link status?

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com

nightrider7731 · ‎02-18-2013

Link status. Only two uplinks.

gman18480 · ‎02-18-2013

This is a long shot but are you getting any errors like this in your host logs about netqueue? Had a similar issue with an m1000 dell chassis

2012-07-17T20:45:09.053Z cpu14:2091)<6>tg3 : vmnic3: RX NetQ allocated on 1
2012-07-17T20:45:09.053Z cpu14:2091)<6>tg3 : vmnic3: NetQ set RX Filter: 1 [00:50:56:7f:96:94 0]
2012-07-17T20:45:44.054Z cpu7:2091)<6>tg3 : vmnic3: NetQ remove RX filter: 1
2012-07-17T20:45:44.054Z cpu7:2091)<6>tg3 : vmnic3: Free NetQ RX Queue: 1

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com

UoGsys · ‎02-20-2013

Hi all - great to see some activity on this post as the issue from our point of view is still very much unresolved.

Without tempting fate we've not had a re-occurance since the last outage which has to be around 6 months ago.

We have also just been assigned a new account manager by HP and have asked for the case we logged to be reviewed again. As things stand HP had exhausted troubleshooting from their end. All we have is a set of formal procedures to capture as much information as possible if this re-occurs (logs, wireshark captures, etc).

Reading through your indivdual situations, ours sounds similar in terms of the net result, but differs in that I do not remember seeing any errors recorded in VCs logs, or any issues with it's domain state. I don't ever remember a NO_COMM error as I'm sure the domain was still accessible. VC thought everything was operating normally... As things stood we were in such a blind panic to restore services, apart from log gathering, my memory of the event is hazy! We simply powered down and up the entire chassis and everything carried on as normal. I then remember having some weird vNIC errors on around 50% of VMs where they failed to power on. We had to remove the affected vNICs. re-add them and the VM powered up. VMware couldn't explain this vNIC fault as we logged a case after we had recovered services.

In response to queries, we have 'Link status only' set in our vDS. I also can't see any pause frame errors -

Dot3InPauseFrames:	0
Dot3OutPauseFrames:	0

Just looking through hostd now and from the logs back when we had the failure and I can't see anything related to NetQ. I can post the hostd log from the time of the failure if thats of any use?

Thanks in advance!