We have a vSphere host that lost it's NIC and now we cannot connect to it. It is saying disconnected and all the VM's running on it are disconnected.
I could have sworn that in the event of this happening, the VM's would Vmotion to a good host, but obviously, VMWare never thought this would happen.
The VM's are designed to vmotion, but they don't, and we can't migrate them. If I try to remove the host, it will take all the VM's with it.
How do I move these VM's to a good host so that I can have my production critical VM's online again?
I think you have DRS and HA confused. DRS will vmotion guests to balance the resource load across hosts and HA will restart VM's on another host should the host fail or become isolated. FT keeps a "ghost" version of selected guests running on a second host in the event that the "running" guest or it's host crash.
Since neither of those things have happened, the critical part will be attempting to restore connectivity to the service console uplink (I'm assuming that that is the one you have lost) back online so you can get the host connected back into vcenter. are you sure that you have lost connectivity to the service console? can you ping or ssh/putty into the host or connect with the vsphere client directly to the hosts IP address?
If you can't get it back, then you are probably left with remotely shutting down the VM's (assuming you have a seperate vswitch/uplink to the vm network) and removing the host from the vcenter inventory. Then you can re-add the guests into inventory with the datastore browser on one of the "working" hosts and powering them up.
If the VM are still running and you have fixed the NIC connection, then simple reconnect the ESX hosts.
To prevent this problem, you can use NIC teaming.
To manage network or hosts faults you can use VMware HA.
Well, after two hours of troubleshooting, we found out that the NIC and the server were perfectly fine. What ended up happening was the HBA connecting to the SAN burnt up. This cuased the system to orphan about 50 VM's. The only way we could un-orphan them was to remove them from inventory and then re-add them.
Once again, I feel that VMWare failed us. We have HA for this reason, and it did not work.
Thank you for those that tried to help. It ended up being something we did not for see, and obviously VMWare has no plan to figure out how to get HA to work in the event that a host loses connection to the datastore.
Good job VMWare. And Happy New Year.
The HA solutions vmware provides are for host failures, not storage failures, In a properly designed system with redundant storage heads and storage connections, you don't have to worry about HA kicking in as the server would never lose connectivity to the storage unless the server itself failed, which would kick in HA, or if the storage failed (which means thats beyond the capabilities of any attached device to recover from).
Seems like you didn't design redundancy for one of the most critical components of your solution, so you are at fault.
So, Good job Wililupy and Happy New Year...
That was the problem. When VMWare designed and installed our production environment, they did have redundant connections to the SAN, however, the monitoring was never enabled or was never setup properly.
On Monday, I have to go to the remote site and verify everything is connected and working properly, and then I will enable HBA monitoring.
I understand that HA is for hosts only. However, becuase the host failed and not the san, shouldn't the other hosts have picked up the running VM's when the host lost its connection to the SAN? The other 5 esx hosts were still running and connected to the SAN and after I removed the orphaned vm's from the esx host that could no longer communicate with the san, I would browse the data store on one of the working esx hosts and add the vm's back to the inventory and they worked fine after acknowledging that I moved them and not copied them during power up.
I was just extremely frustrated at the situation since when I ran this situation in my test environment this morning, I never had this problem. I built a VM on a two node esx cluster and after I pulled the fiber from one host to the san, the running vm failed over to the other host and only had about 30 seconds of down time.
The main reason why I blamed VMWare for this was becuase they installed our system and set it all up, and we have it in writing that they said in the event that our data store would not be available in case a fiber cable was disconnected or we lost an HBA, we would be fine. In this situation, the server was still available, but it lost connection to the SAN because the HBA card was burnt out. In this case, it was the host that failed, and not the SAN, and becuase of that, the other hosts should have picked up the orphaned VM's but did not and we had to go through this huge ordeal of manualling doing it.
Lots of fun on News Year Eve. Hopefully this is not a sign of what the new decade is going to have in store.
HA monitors host failure/isolation through the service console network, by default. If the system never lost network connectivity, then HA would have no reason to trigger an event unless you had something else setup to monitor the SAN; which I understand you to say that something was in place but not enabled???
That makes more sense.
From my test environment, we have a monitoring setup so that if a host loses connectivity, it initiates a failover, however, we found out through this that ESX does not understand it unless you have the client installed on the host. After we upgraded from 3.0.2 to 4.0, the client was not upgraded. Since this, we have gone back and installed the client on the hosts so that they are being monitored through the linux os of ESX.
This is why it worked in my test environment becuase we had it all setup.
The reason of this failure was becuase of negligence on our part and thinking that when we upgraded the hosts, all software installed on the server would be migrated with it. I guess when you upgrade, it creates an image of the old install and writes a new installation to the root. This was an assumption that came back to bite us.
Thank you for all your help and being able to explain the whole high availability requirements.