Anyone else seeing the issue below?
We are using HP BL685G6 blades but I think this issue also happens with the half heights.
Virtual Connect Flex-10 modules running firmware v2.30
ESX 4.0 or 4.0 U1, either shows the issue.
What we're seeing is when an uplink from the Flex-10 modules is pulled out, the status of the vmnics does not change. We should see the vmnics as 'down' but they continue to show as active.
Although an alert is shown in vCenter to say redundancy is lost, you do not see and red crosses by the cards in the networking configuration screen.
Because of this issue, unless you configure network failover to use beacon probing, you will lose connection if a link goes down.
Also, if you're planning to use the Nexus 1000v (we've tested this) there isn't an option for beacon probing,
We are currently going through a design build with HP and have come across the same problem.
HP documentation is at fault here as it is not adequately explaining how the Flex-Nics work.
In the following document: HP Flex-10 Technology.pdf brief on Page 14 it says:
SmartLink not supported on individual FlexNICs but will continue to be supported from the physical port point of view. In this Flex-10 release, VC is not capable of dropping the link to a single FlexNIC. Therefore, the physical link will drop only when all networks assigned to all FlexNICs on a given physical port have “SmartLink” option checked, and all of their uplinks are broken. Conversely, as soon as the first vNet has a single port restored, the physical port, and therefore all FlexNICs, will have the link restored. This functions in a similar fashion to non-Flex-10 ports with multiple networks assigned. SmartLink continues to operate normally for traditional 1Gb NICs in a VC environment and will be fully functional for Flex-10 operation in a future release of VC firmware.
That indicates that you cannot use SmartLink for individual Flex-Nics so if the uplinks do fail it will not pass that information onto the blades and so VMware will not fail over and either the LAN or NAS traffic will be blocked.
However back to HP BladeSystem Reference Architecture Virtual Connect Flex-10 and VMware vSphere 4.0.pdf on page 24 it says:
Virtual Connect firmware v2.30 introduced Dynamic Control Channel (DCC) support to enable SmartLink with FlexNICs. This allows for individual FlexNIC
state change if the uplink ports for a defined vNet are no longer available to force NIC teaming software failover.
So, which is it?
Our testing shows that Smart Link is not working on a FlexNic even though we are on firmware 2.31
We have 2 x Virtual Connect Flex-10 modules per chassis, 3 chassis in a rack, so 6 Flex-10s per rack.
All Internal and external stacking links are active.
We tunnel all VLANS through the Flex-10 so we don't have the extra administration of specifying vlans in the Flex-10 modules so they pass through all vlans.
We have created the following Virtual Connect Ethernet Networks for production use.
We also create 4 more networks with no uplinks which we called unused nics. This is so we can apply the server profile with pre-created networks so we shouldn't have to power down any blades once deployed to amend the server profiles.
These unused networks may be used in the future to add additional networking for example for within VM iSCSI storage(ie. not the disk files but
remote databases etc.)
The server profiles are then applied which add the 8 networks which correspond to this:
Virtual Connect Port
VMware assigned Nic
The 2 esx networks, esx_switch_a and esx_switch_b is for all external traffic which includes LAN traffic for Virtual Machine and Service Console and NAS traffic for NFS storage. All traffic is logically separated by vlans.
The 2 vmotion networks esx_int_vmotion_a and esx_int_vmotion_b have no uplinks assigned which use the internal stacking network per rack so vmotion traffic is all internal to the rack.
We have also tried to force SmartLInk to think that all links are down by allocating uplinks to all the internal networks but without them connecting to anything.
When we disconnect the real uplink it still didn't mark the vmnic as down.
We can solve it with Beacon Probing but I don't know whether this is a better way or having redundent uplinks at the switch level, basically having 2 uplinks from the Ethernet Network but from different Flex-10 modules so if a switch fails there is another route. One uplink would be Active and the other Standby
We have a call later today with HP to discuss and I'll update here with any more information.
What has frustrated me is running ESX on blades through Flex-10 is not a different scenario and HP's own documentation is confusing and at times wrong. the VC cookbook is out of date.
If anyone has ever read the technical reports from NetApp on how to set up their equipment with various different vendor products, they know how a solution document should be written....come on HP.
Thanks for your very quick response, this is very useful information.
We have a single vSwitch for the service console which uses vmnic0 & vmnic1 (Flex-NICS 1a & 2a). These go to seperate Virtual Connect modules and seperate external Cisco switches for redundancy.
We configure vmnic0 as active with vmnic1 as standby and ensure the failover setting is beacon probing.
Failover works as expected, but the status isn't updated. I'm planning to also speak with HP again today.
I have been informed that an Alpha driver for ESX 4.x from Broadcom with DCC support is currently with HP and should be available by late January 2010.
That's what we have also heard from HP. I've said I will not update any ESX drivers without them being integrated in a future version oF ESX. Last thing I want is custom drivers polluting my ESX installation.
What is still strange though is HP are saying the SmartLInk should work for individual Flex-NIcs on 2.30 but without beacon probing it doesn't. Using Beacon probing is in my mind is bypassing SmartLink so it doesn't look as though SmartLink is actually working without the updated Nic Driver. If so, why does the documentation suggest this as a solution.
I agree with you.
We are having a really difficult time of it because we're also testing the Nexus 1000v switch works fine with the Flex-10 until you fail an uplink.
This is a massive issue for us since the Nexus doesn't have a beacon probing option and is dependant on the Smartlink functioning correctly.
Like you, I don't really want to load additional drivers unless they are provided by VMware and bundled in a support package.
B.T.W - VMware are also testing this issue in their engineering labs at the moment
Here's the solution now agreed with HP.
We have 4 uplinks from the rack. 2 from Chassis A-A and 2 from Chassis C-B. Each uplink from a Flex-10 switch goes to a different Cisco Nexus switch and we use a vPC to create a logical Port group spanning both Nexus switches. All links look as Active within Virtual Connect. If you don't have Cisco Nexus then both uplinks would need to go to the same switch.
The advantage of splitting the uplinks between Nexus switches is there will be no need for host failover if an upstream switch fails and you can also survive a double failure of an upstream switch and a Flex-10 switch at the same time.
We enable Beacon probing for all failure detection.
All LAN traffic is primary over esx_switch_a and all NAS traffic is primary over esx_switch_b using VMware port groups and Active/Standby links.
We have 2 x internal VMotion networks that connect through the Stacking links to create a per rack network.
If any single uplink fails the vPC will still pass traffic and no host failover is required . If a Nexus switch fails the vPC will still pass traffic and no host failover is required.
If a port group was to fail, Beacon probing would detect the esx switch as having no uplinks and would fail over traffic to the partner.
If a Flex-10 switch was to fail, blades in the same chassis would see one Nic as being disconnected and fail over. If the Flex-10 switch to fail had uplinks, other blades in othe chassis would see the uplinks as not passing traffic and beacon probing would fail over.
HP say the updated Broadcom driver to enable Smart-LInk for a Flex-Nic is going to be available in January and is going through VMware certification.
We will only install this driver if it is released as a patch from VMware.
What this will do is speed up fail over as Beacon Probing waits for 3 beacons to fail before failing over. This does lose a few pings but not many so is OK for now. With Smart Link fail over will be immediate.
So, HP need to update their documentation to specify when Smart Link will work as described and should also explicitly recommend in the document that Beacon Probing is the failure detection mechanism required.
To be able to use "Link status only" in VMWare 4.x together with HP Virtual Connect Flex-10 modules you need to get DCC support in every aspects.
Driver, NIC firmware and Virtual Connect software.
To get DCC support use theese minimum versions:
-HP BL495c G5/G6 BIOS A14 2009-12-09 updates NC532i fw to 5.0.17
-NC532i PXE Boot Code fw 5.0.11 or newer, we use 5.2.7
-HP Virtual Connect version 2.30 or newer, we use 2.32
-VMWare driver for bnx2x version 1.5.0 or newer, we use 1.52.12.v40.3
We used the "Broadcom online firmware update utility 220.127.116.11" to update the PXE Boot code firmware.
If you install HP Mgmt Utilities 8.3.1 for ESX 4.x you can use the HP Homepage to se all firmware use this syntax: "https://hostname:2381".
Now if you loose the uplink to Virtual Connect, affected vmnic get a red cross as expected.
I am experiencing similar issues, with rougly the same setup (we use 460c blades). After updating firmware and the VMware driver we can failover OK using link status only. We do have a problem however failing back ( = reinserting the VirtualConnect module, or powering on via OA). In this case the VC module brings up the link (but not the protocol) and smartlink imediately enables the link to the ESX server, which fails back (because we set it up to do so.)
Because the VC module isn't entirely online yet we experience timeouts, failing file copy action and ping-loss. Are you having these problems as well?
With beacon probing on the failback goes well btw.
I will try theese things out next week when we will do more extensive tests of our configuration.
Now we have only tried the link failover test by dropping the uplinks.
What I can´t understand is how VMWare could certify the HP Virtual Connect Flex-10 modules together with VMWare 3.5 U4 early 2009 or maybe late 2008?
We are running our production setup since mars 2009 with borrowed components by HP, because of our supportcases not could be solved.
It´s right now it start work-out as it should and we possibly can run our production in this solution.
But we´ll see what the test will show up next.
Same problem here.
Using "Link Status Only" and reinserting the VirtualConnect module caused all our test VM´s to reboot.
Same test using "Beacon Probing" works fine.
Just curious, have someone used ILO2 1.81 and tried to install VMWare 4.0, Good Luck.
So the v1.52.12.v40.3 notes explicitly say do not use with HP flex-10 as testing is underway and not complete. Months later...still no newer driver. What a bummer!!!
seems to be supported now but I haven't tested it yet.
It's a shame that it's not actually a new version or build.
I'm still having problems with the driver with RTSP traffic between a Microsoft App-V server and client. As soon as my client starts streaming the application from the server, the ESXi server PSODs with errors alluding to the bnx2x driver being at fault. The older 1.48 driver didn't have this problem, but it also doesn't support DCC/SmartLink either and I don't want to use beacon probing as a failover mechanism.
Doh! We are looking at implementing App-V, so I'm alarmed that RTSP traffic causes PSOD when using the latest driver. I also don't want to use beaconing and need to use the SmartLink feature. Did you open a support ticket with VMware about this issue?
@DSeaman: Doh sums it up quite nicely! And now I wouldn't even say it's restricted to RTSP traffic as I had another server PSOD this morning - however this time the App-V server wasn't on it so I have to rule out RTSP being the cause on this one. PSOD still points to the bnx2x driver being the root cause. I've attached a screenshot of the PSOD so maybe someone can confirm my thinking?
Btw, I have to mention that it's only when RTSP (if that is the cause) leaves the flex-10 modules that it causes the server to PSOD. By "leave" I mean that the App-V server is on a different subnet to the App-V client and hence traffice needs to leave the flex-10 module, hit the core switch/router then route to the appropriate subnet which the App-V client resides on. If I have the App-V server and client on the same portgroup/subnet/chassis then there is no problem so it all depends on your environment. But at the end of the day, I suspect that you'll want this resolved irrespective of what environment you have setup for App-V.
As for the support ticket, am in the process of getting my details on my client's list of authorised-parties-to-log-a-fault-call-on-their-behalf.
I'll update this thread as I work through the problem.
Has anyone verified smartlink is working? I did not install the driver while it was unsupported so I never tested smartlink.
SmartLink with Link State tracking on the vSwitch works fine.
Same problem here with ESXi 4 U1 hosts running the 1.52.12.v40.3 driver, but with SQL 2008 VM's running on the hosts, all was good for about a week then starting getting random PSOD's. We logged a call with VMware (through HP support) and they have come back to say we should roll back to the previous driver.
So now, having waited ages for the driver to become officially supported, we are back to the hosts having no protection against link/core switch failure. Great.
After VMware reviewed my vm-support logs, they came back to me saying "we have found that this is a known issue and we have reported this to our engineering team. We do not have a fix for this issue as of now but as our engineering team is working on it , we will definitely have a fix for this in the next update."
Major bummer. We've already rolled back to the older 1.48 driver on our production cluster but we've kept another two ESXi servers separate for testing. I've alreeady upgraded them to 4u2 in the vain hope that they may have fixed the issue.
I'll keep the thread updated with my findings.