1 2 Previous Next 18 Replies Latest reply on Feb 10, 2009 12:50 PM by khughes

    Loss of switch causing VI to shutdown?

    khughes Virtuoso

       

      This morning my boss was moving some power cables around to different PDU's one of them being a Gig switch hosting 1/2 of our ESX NICs.  For (what we thought) was redundancy reasons we split the ESX host NICs onto two different Gig switches incase one went down the other would keep going.  I wasn't in the office for this but I got an early phone call to come in when the infrastructure went down.  Looking at the vpx logs once he pulled the power on that switch (which also contained the NIC connection for our physical VC) virtual machines started powering down. Among our virtual machines were our domain controllers which also house our DNS so that might be another issue to consider.

       

       

       

       

       

      I have a ticket open with VMware already but I was curious if anyone might point out something wrong in our configuration as to why when the switch went down all the VM's started to power off.

       

       

       

       

       

      • Kyle

       

       

        • 1. Re: Loss of switch causing VI to shutdown?
          weinstein5 Guru

          See that is what you get when you let your boss touch stuff  ?:| - were you running HA and was your isolation response configured to power down VMs? Because it sounds like the isolation response kicked in and when DNS went down it hose HA from bringing them back up - just my assessment -

           

           

           

           

          If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

          • 2. Re: Loss of switch causing VI to shutdown?
            kooltechies Expert

             

            Hi,

             

             

            Do you have HA/DRS configured?  I am not sure but this looks like more of a split brain situation in which the physical switch failure may cause all the primary servers in a HA setup to not being able to reach other and start powering of the VMs for fail over to other hosts , but that never happend as other host is not reachable , and VC disconnection may have made it worse.

             

             

            Thanks,

             

             

            Samir

             

             

            • 3. Re: Loss of switch causing VI to shutdown?
              khughes Virtuoso

              Doesn't isolation mean that it can't talk to anything or any other host?  We do have some configured to shutdown in that case but machines that weren't configured were shut down as well.  Also in the ESX host files we have manually put in the other hosts in case we lost our DNS so they would be able to talk to all the other servers including our VC.  I'm not sure if we put that manual entery into our VC though

               

               

               

              • Kyle

               

              • 4. Re: Loss of switch causing VI to shutdown?
                kjb007 Guru

                 

                Loss of vc should not cause this type of problem, but loss of DNS may very well have.  Once you configure HA from vc, the servers basically keep track of each other, but if DNS is gone, and they can't talk to each other anymore, then as weinstein5 stated, you may have ended up with all of your hosts thinking they were isolated, which would cause them to perform their isolation response.  Depending on your version, by default this would be to poweroff the vm's.

                 

                 

                 

                 

                 

                -KjB

                 

                 

                • 5. Re: Loss of switch causing VI to shutdown?
                  kjb007 Guru

                   

                  If you had hosts entries, then DNS should not have caused this issue.  Make sure your host entries are correct.  Also make sure the VLAN you are using to talk on the network for your service console is available on both of your physical switches.

                   

                   

                   

                   

                   

                  -KjB

                   

                   

                  • 6. Re: Loss of switch causing VI to shutdown?
                    weinstein5 Guru

                    it is not to the other host - HA looks towards the SC gateway - if it can not see the gateway it will assume it is isolated and will initiate the isolation response -

                     

                     

                     

                     

                    If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

                    • 7. Re: Loss of switch causing VI to shutdown?
                      khughes Virtuoso

                       

                      Well once I got in, I was able to login to the hosts, power on our DC/DNS server from the host VIC and then was able to reconnect all the hosts from our VC (right click -> connect) which brought everything back online, almost like they all powered back up.  Very weird.  VLAN is pretty easy to diag since its only one VLAN... I know I know about the whole splitting of the networks, security etc... I've brought it up numours times but they don't want to do it or spend the money on the changes. ANYWAYS... I checked the hosts files again on the ESX boxes and apparently I only manually entered in the other ESX hosts, not the VirtualCenter server. Also there are no manual entries in the hosts file on the VirtualCenter server.

                       

                       

                       

                       

                       

                      Per my suggestion we're going to be adding DNS functionality to our physical backup server which should prevent DNS issues if we have a failure like this in the future.

                       

                       

                       

                       

                      • Kyle

                       

                       

                      • 8. Re: Loss of switch causing VI to shutdown?
                        khughes Virtuoso

                         

                        I believe that he also pulled the power from our ASA firewall device as well.... which would be our SC gateway

                         

                         

                         

                        Even if that did trigger HA and some of the rules said to power off, why would the other VM's power off? They're set to "use cluster settings"  where can I check to see what those default cluster settings might be?

                         

                         

                         

                         

                         

                        • Kyle

                         

                         

                        • 9. Re: Loss of switch causing VI to shutdown?
                          weinstein5 Guru

                          Those would be set in the main cluster settings -

                           

                           

                           

                           

                          If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

                          • 10. Re: Loss of switch causing VI to shutdown?
                            khughes Virtuoso

                             

                            Which would be right here - Power Off VM

                             

                             

                             

                             

                             

                             

                             

                            So at any point the ESX host can't contact the SC Gateway is it going to trigger an isolation response? Or was this compounded because it lost the main switch, couldn't talk to the rest of the network including the SC gateway?

                             

                             

                             

                             

                             

                            • Kyle

                             

                             

                            • 11. Re: Loss of switch causing VI to shutdown?
                              Troy Clavell Guru
                              vExpert

                               

                              there is a 15 second default heartbeat time.  If within that time the host(s) doesn't respond an isolation event will be triggered.  You can do a couple things to help in keeping you VM's on-line during an HA isolation event.  You can increase the heartbeat time, create a second COS, or use your VMotion NIC as the heartbeat.

                               

                               

                              • 12. Re: Loss of switch causing VI to shutdown?
                                weinstein5 Guru

                                Troy 's answer is right on target -

                                 

                                 

                                 

                                 

                                If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

                                1 person found this helpful
                                • 13. Re: Loss of switch causing VI to shutdown?
                                  patrickds Expert

                                  If the HA agents cannot contact eachother, and are unable to contact the SC gateway, they will trigger the isolation response.

                                  So in your case, both hosts decided they were the one being disconnected, and both triggered their isolation response.

                                   

                                  Apparently your SC connection isn't as redundant as you thought it was.

                                  If you had 2 physical nics backing the vswitch your SC port group was on, each connected to a different switch, this would not have happened.

                                  • 14. Re: Loss of switch causing VI to shutdown?
                                    khughes Virtuoso

                                     

                                    Sorry for this really crappy drawing using mspaint but here is kinda a legist of what it looks like...

                                     

                                     

                                    Power was pulled from Gigi 0/0 so there were still two paths for production network / vMotion / SC to the other gigi switch 0/1, but there was no path for it to go anywhere else.  Obviously HA is a good thing to have enabled, as long as it is configured correctly.  Troy for changing the HA heartbeat timers and or what is used for it, where would that be done, how practical is it to do and are there any concerns I should have doing it?  Obviously if I changed it to the vmotion NICs I would never have an isolation event unless both switches went down or I lost all 4 NICs, but then again that would be for sure an isolation event...  What problems would possibly come up from extending the heartbeat timers?

                                     

                                     

                                     

                                     

                                    • Kyle

                                     

                                     

                                    1 2 Previous Next