VMware Cloud Community
DefenderAtkins
Enthusiast
Enthusiast
Jump to solution

VMWare 5.5 vSphere HA has no errors or warnings but does not work, neither does management redundancy.

Hi all,

I have 3 hosts running ESXi 5.5. I manage these hosts from a vCenter server.

I have created a cluster out of these hosts and enabled HA on the cluster. There are no error messages or no warnings. The vSphere HA summary page of the cluster shows:

pastedImage_0.png

In addition to this, I have management network redundancy on each host via VDSwitch. There is portgroup called mgmt, which is for management network, this portgroup is connected to 3 NIC uplinks (Fibre A, Fibre B and Copper) for redundancy. Another portgroup I have is vMotion. Both mgmt and vMotion portgroups have correct vLans:

pastedImage_2.png

Now I have found two problems:

1. When I disabled the port connected to vmnic5Uplink from the the Cisco switch, the host goes down; meaning there is no management network redundancy although I am using all 3 available NICs for management traffic. Nowhere in vSphere does it mention there is a problem with management redundancy or any warnings regarding that. Only when I disabled ports connected to any of the 3 uplinks, the host goes entirely down (Not responding).

2. When the host goes down, HA does not work. HA tries to migrate VMs residing in the affected host but HA migration fails. Before the host becomes "not responding" stage, there is this error:

pastedImage_5.png

pastedImage_3.png

After about 10 seconds of this error message, it says

pastedImage_4.png

I cannot figure out why this is happening. Anyone has any ideas from looking at the above?

Thanks

Tags (1)
1 Solution

Accepted Solutions
hussainbte
Expert
Expert
Jump to solution

The teaming configuration needs modification..

vmnic4 is the only active nic which means as long as the upstream port for vmnic4 is not down the other 2 nics will not be used for management traffic.

"when I shut ports from the switch going to both Fibre A and B, management network does not fail over to copper." I am not sure why you have this..

Having said that, I suggest you use 2 fibre adapters for your management.

the 2+1 configurations seems little odd in just that I have not seen such a thing before.. I am also not sure if it can be considered as a design best practices.

I would rather suggest you to use below config..

Fibre port A and Fibre port B both active..

Team them up in the physical switch end and use Route based on physical nic load..

You will get redundancy and load balancing..

What upstream switch do you have..?

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

View solution in original post

0 Kudos
8 Replies
hussainbte
Expert
Expert
Jump to solution

I have couple of questions...

1) what happens when you disable any one of the 2 fibre ports (A or B)

2) could you also share the host isolation response setting on the cluster

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
0 Kudos
ShekharRana
Enthusiast
Enthusiast
Jump to solution

1. What do you mean by Host goes down?? is it not on the network?? 2. Please share the NIC teaming policy on switch. 3. Check the network port settings on physical switch. 4. are VMs on shared Storage?? 5. What is the HA protection status on the VMs?

0 Kudos
Finikiez
Champion
Champion
Jump to solution

Hi!

To properly test HA you need to power off the host with running VMs using iLO\iDRAC or by pushing power button.

Also you can generate PSOD manually vsish -e set /reliability/crashMe/Panic 1

As well HA works properly only when your VM reside on shared datastore. VMs on local datastores can not be restarted on other hosts.

When you down network ports more than likely you will get network isolation status (as you posted on screenshot). There is separate action for this. Check Host isolation responce in HA cluster settings.

When management network is down and connectivity to shared storage is active (it means that you have Fibre Channel storage) than datastore heartbeating works. And your host can tell to master host which VMs are active and running.

0 Kudos
DefenderAtkins
Enthusiast
Enthusiast
Jump to solution

Hi ShekharRana,

Thanks for your reply.

1. What do you mean by Host goes down?? is it not on the network??

I can no longer ping the host IP. And host becomes "not responding" status in vCenter.

2. Please share the NIC teaming policy on switch.

pastedImage_0.png

pastedImage_2.png

3. Check the network port settings on physical switch.

Physical switch has absolutely correct settings, as I have confirmed together with my network administrator.

4. are VMs on shared Storage??

All 3 are using 2 volumes from the SAN. They are mapped via iSCSI adapters. I'm not sure if this is shared storage or not.

5. What is the HA protection status on the VMs?

Cluster HA settings

VM summary (same for all VMs in this cluster

0 Kudos
DefenderAtkins
Enthusiast
Enthusiast
Jump to solution

Hi Finikiez,

None of the VMs are on local datastores. However, they are residing in SAN volumes attached to the hosts via iSCSI adapters.

0 Kudos
hussainbte
Expert
Expert
Jump to solution

can you take the copper link out of the picture and test HA..

with 2fibre nics you still have redundancy..

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
0 Kudos
DefenderAtkins
Enthusiast
Enthusiast
Jump to solution

Hi Hussainbte,

Thanks for your suggestion. I was doing exactly that yesterday and redundancy works.

There is a very strange thing happening here.

  1. my network setup; Fibre A and B NICs from the host are connected to the fabric switch which has trunk to core switch.
  2. Copper port is connected to copper switch which has separate trunk to core switch. 
  3. I run continuous ping from a physical PC which has access to management network.
  4. when I shut the port from the switch going to Fibre A NIC on the host, management network fails over to Fibre B.
  5. I lost 5 pings and it comes back up.
  6. I re-enable link to Fibre A.
  7. when I shut the port from the switch going to Fibre B NIC on the host, management network fails over to Fibre B.
  8. I lost 5 pings and it comes back up.
  9. when I shut ports from the switch going to both Fibre A and B, management network does not fail over to copper.
  10. 100% ping lost from both PC ... AND Copper switch.
  11. When I re-enable Fibre A or B or both, then I can ping from PC  AND Copper switch can ping the management IP again.
  12. I don't understand why the state of Fibre A and Fibre B could have such effect on the Copper port.

Here is how my teaming and failover is set up for Mgmt portgroup in the DVSwitch:

pastedImage_3.png

hussainbte
Expert
Expert
Jump to solution

The teaming configuration needs modification..

vmnic4 is the only active nic which means as long as the upstream port for vmnic4 is not down the other 2 nics will not be used for management traffic.

"when I shut ports from the switch going to both Fibre A and B, management network does not fail over to copper." I am not sure why you have this..

Having said that, I suggest you use 2 fibre adapters for your management.

the 2+1 configurations seems little odd in just that I have not seen such a thing before.. I am also not sure if it can be considered as a design best practices.

I would rather suggest you to use below config..

Fibre port A and Fibre port B both active..

Team them up in the physical switch end and use Route based on physical nic load..

You will get redundancy and load balancing..

What upstream switch do you have..?

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
0 Kudos