1 2 Previous Next 16 Replies Latest reply on Nov 14, 2014 2:27 PM by brugh2

    HELP! VSAN is empty after black out!

    brugh2 Enthusiast

      we had an outage and all ESX servers rebooted. after they came back the /vmfs/volumes/vsanDatastore was empty. all our VMs were on VSAN, including vcenter.

      how can i recover my VM's?!? My entire infrastructure is down.

        • 1. Re: HELP! VSAN is empty after black out!
          brugh2 Enthusiast

          ok, i just realized a little more information on my setup might help.

          i have 3 servers with 5 disks and 2 ssd's. i can vmkping the other vsan kernelports on all servers.

           

          it seems that the vsan configuration still exists:

          ~ # vdq -i -H

          Mappings:

             DiskMapping[0]:

                     SSD:  eui.a31ab98418a7460700247110375fb631

                      MD:  naa.600605b008bcb1201b3529cdfc5b309e

                      MD:  naa.600605b008bcb1201b352b3411b960ee

             DiskMapping[2]:

                     SSD:  naa.600605b008bcb1201b352b3411ba71fe

                      MD:  naa.600605b008bcb1201b352b3411b87db4

                      MD:  naa.600605b008bcb1201b352b3411b8a008

                      MD:  naa.600605b008bcb1201b352b3411b9a715

           

          but it does look like the vsan membership isn't what it should be;

           

          ~ # esxcli vsan  cluster  get

          Cluster Information

             Enabled: true

             Current Local Time: 2014-11-13T21:14:37Z

             Local Node UUID: 539a9749-dfe0-e17f-c529-1005ca9e2fa8

             Local Node State: MASTER

             Local Node Health State: HEALTHY

             Sub-Cluster Master UUID: 539a9749-dfe0-e17f-c529-1005ca9e2fa8

             Sub-Cluster Backup UUID:

             Sub-Cluster UUID: 54985fab-d608-465f-b5eb-bf9896677959

             Sub-Cluster Membership Entry Revision: 0

             Sub-Cluster Member UUIDs: 539a9749-dfe0-e17f-c529-1005ca9e2fa8

             Sub-Cluster Membership UUID: dc136554-cf4b-4d8b-29e6-1005ca9e2fa8

           

          where it shows only one member in Sub-Cluster Member UUIDs. i think there should be 3.

           

          still now clue how to proceed. i still have a blank VSAN...

          • 2. Re: HELP! VSAN is empty after black out!
            jetaylor Novice
            VMware Employees

            brugh2,

             

            It certainly looks like a network partition has occurred. We have one node in our cluster instead of the three you indicated should be there. If this is the case on all three nodes, we won't be able to form a quorum and get production online.

            Do you have a VMware Support Request filed? If you do, will you please PM me the SR number?

            Also, we can try a couple of things.

            1) On each host, make sure the network tagging is intact and we are associated with a vmknic:
            # esxcli vsan network list*

            2) If the network is still tagged properly (it should be), try to ping each VSAN node from each VSAN node (e.g., ping your partner machines).

            3) If the ping works, determine if we are using jumbo frames. If we are, ensure that that jumbo frames are configured completely (vmknic, vswitch, physical NIC, physical switch).**

            --> If jumbo frames are in use, send a large frame ping without permitting fragmentation:

                 # vmkping -s 8500 -d <destination address>

            4) If the jumbo frames (if applicable) do NOT work, fix the MTU in the physical switch or drop your vmknics back down to 1500 MTU.

            5) If everything at the transport level checks out, we very-likely have a multicast problem. Validate your IGMP groups/snooping/couriers/etc. on the physical switch to ensure that multicast is being handled properly.

             

            Please let me know how things go!

             

            * The output should look something like this (from my infrastructure):

            Interface

               VmkNic Name: vmk1

               IP Protocol: IPv4

               Interface UUID: 9ebf0854-3a78-734f-b15e-90b11c2b6604

               Agent Group Multicast Address: 224.2.3.4

               Agent Group Multicast Port: 23451

               Master Group Multicast Address: 224.1.2.3

               Master Group Multicast Port: 12345

               Multicast TTL: 5

             

            ** You can use the following commands to check the jumbo frame configurations (I don't use them, so my MTUs are all 1500):

            ~ # esxcfg-vmknic -l |grep vmk1 <== I am examining vmk1 because that is the interface we got from esxcli.

            vmk1       VSAN                IPv4      172.200.200.207                         255.255.255.0   172.200.200.255 00:50:56:68:00:fb 1500    65535     true    STATIC

             

            ~ # esxcfg-vswitch -l

            [ ... ]

            Switch Name  Num Ports   Used Ports  Configured Ports  MTU Uplinks
            vSwitch1     2352    6       128           1500vmnic2,vmnic3

             

              PortGroup Name    VLAN ID  Used Ports  Uplinks
              VSAN              0    1       vmnic2,vmnic3

            ^^ the vmknic is called "VSAN," as is the port-group name. If you a distributed vSwitch, you will need to look for the port number instead of a portgroup name (the number will still be in the esxcfg-vmknic -l output).

             

            ~ # esxcfg-nics -l

            Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description

            [ ... ]

            vmnic2  0000:41:00.00 bnx2x       Up   10000Mbps Full   00:10:18:f1:b8:40 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet

            vmnic3  0000:41:00.01 bnx2x       Up   10000Mbps Full   00:10:18:f1:b8:42 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet

            ^^ these are the two physical NICs used as uplinks by the vmknic port group.

             

            All of the MTUs above are 1500 bytes. If you are using jumbo frames, they should all be 9000 bytes.

            • 3. Re: HELP! VSAN is empty after black out!
              brugh2 Enthusiast

              Hi jetaylor

              i'm not using jumbo frames. i tried vmkping on all hosts with the max framesize of 1472 and they all ping eachother perfectly.

              as for other network settings, the switch settings are the same as before and vsan worked fine so i'm assuming that switch config is still good.

               

              i did a reboot of all hosts 15 minutes apart in the meantime. on 2 of the 3 hosts i got part of the vsan cluster back. 2 now say that the cluster consists of 2 hosts and 1 still things it's alone.

              that also means some of the VMs are back, just the objects that i think were owned by the one host are empty. weird thing is, if i do a 'du -shm' of the empty directories it says there's data in it but 'osfs-ls' comes up empty.

              i'm hesitant to remove the 1 host from the cluster from commandline and putting it back it.

              • 4. Re: HELP! VSAN is empty after black out!
                jetaylor Novice
                VMware Employees

                If they are pinging normally, we are probably in a weird state with regard to multicast. Did you explicitly configure multicast/IGMP on your physical switch, or did it just "find a way?"

                If we are back up (even if degraded), that is a great thing, but we definitely want to get host 1 back into the mix.

                 

                I am also disinclined to do an esxcli vsan cluster leave/esxcli vsan cluster join right now, too. That shouldn't necessarily resolve the problem, either, since it appears to be a communication issue.

                All VSAN data movement occurs as unicast traffic but the clustering communication and quorum-maintenance are multicast. If multicast isn't working, we end up partitioned even if unicast is working (which it clearly is).

                 

                If no physical-switch configuration was done and it all "just worked" when it was spun up, then it is probably either automatically handling multicast or it is converting the traffic to broadcast.

                If we didn't do any switch config, we can try powering down (not rebooting) the node 15 minutes or so so all information about the host decays out of the switch (MAC tables clear up, etc.). When we power back on it, it will hopefully repopulate everything as expecting and come back up.

                 

                I am loathe to reboot the physical switch, etc. since we do have our VMs back online and we don't want to risk taking everything down again right now, since we are back to running.

                • 5. Re: HELP! VSAN is empty after black out!
                  brugh2 Enthusiast

                  here's the output from the configuration of the 1 server that thinks it's alone:

                  ~ # esxcli vsan network list

                  Interface

                     VmkNic Name: vmk4

                     IP Protocol: IPv4

                     Interface UUID: 02b9e053-e4e8-0216-9412-1005ca9e2fa8

                     Agent Group Multicast Address: 224.2.3.4

                     Agent Group Multicast Port: 23451

                     Master Group Multicast Address: 224.1.2.3

                     Master Group Multicast Port: 12345

                     Multicast TTL: 5

                   

                  ~ # esxcfg-vmknic -l |grep vmk4

                  vmk4       12                                      IPv4      192.168.1.11                            255.255.255.0   192.168.1.255   00:50:56:7d:a1:db 1500    65535     true    STATIC

                  vmk4       12                                      IPv6      fe80::250:56ff:fe7d:a1db                64                              00:50:56:7d:a1:db 1500    65535     true    STATIC, PREFERRED

                   

                  DVS Name         Num Ports   Used Ports  Configured Ports  MTU     Uplinks

                  DSwitch          5632        8           512               1500    vmnic5,vmnic4

                   

                   

                    DVPort ID           In Use      Client

                    576                 1           vmnic4

                    577                 1           vmnic5

                    16                  1           vmk0

                    0                   1           vmk1

                    12                  1           vmk4

                   

                  ~ # esxcfg-nics -l

                  Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description

                  vmnic0  0000:02:00.00 igb         Up   1000Mbps  Full   10:05:ca:9e:2f:a8 1500   Intel Corporation I350 Gigabit Network Connection

                  vmnic1  0000:02:00.01 igb         Down 0Mbps     Half   10:05:ca:9e:2f:a9 1500   Intel Corporation I350 Gigabit Network Connection

                  vmnic2  0000:02:00.02 igb         Down 0Mbps     Half   10:05:ca:9e:2f:aa 1500   Intel Corporation I350 Gigabit Network Connection

                  vmnic3  0000:02:00.03 igb         Down 0Mbps     Half   10:05:ca:9e:2f:ab 1500   Intel Corporation I350 Gigabit Network Connection

                  vmnic4  0000:88:00.00 enic        Up   10000Mbps Full   10:05:ca:a8:c7:b8 1500   Cisco Systems Inc Cisco VIC Ethernet NIC

                  vmnic5  0000:89:00.00 enic        Up   10000Mbps Full   10:05:ca:a8:c7:b9 1500   Cisco Systems Inc Cisco VIC Ethernet NIC

                  • 6. Re: HELP! VSAN is empty after black out!
                    jetaylor Novice
                    VMware Employees

                    Are you possibly using LACP or a static etherchannel to handle the traffic?

                    • 7. Re: HELP! VSAN is empty after black out!
                      brugh2 Enthusiast

                      i'm shutting down the 1 host and will boot it back up in 15min, see if that helps. the switches are configured to work with vsan btw. i have a networking guy that put in the right config in (and saved it ). will have him check tomorrow just to be sure.

                       

                      i wonder though, if the 1 host was just alone i think i would see some files on the vsan datastore, even if it thinks it's isolated. but the vsanDatastore directory is completely empty on that one host. the other 2 have some files but not all of them. some VM directories are missing and some VM directories that are there, are empty. hopefully the vms will come back if the cluster config gets sorted out. would hate to loose some of the VMs on there..

                      • 8. Re: HELP! VSAN is empty after black out!
                        brugh2 Enthusiast

                        not to my knowledge, the network was kept as flat as possible.

                        • 9. Re: HELP! VSAN is empty after black out!
                          brugh2 Enthusiast

                          btw, i havent been able to boot any VMs yet. when i try, i get an error and the log shows:

                           

                          VSAN: VsanIoctlCtrlNode:1746: aec35854-5321-3c01-1eea-1005ca9e2fa8: RPC to DOM returned: No connection

                           

                          sounds pretty bad. the 2 nodes are not happy yet either.

                          • 10. Re: HELP! VSAN is empty after black out!
                            jetaylor Novice
                            VMware Employees

                            The NO_CONNECTION is likely due to us not having (or have lost) quorum on a per-object basis.

                            Are the two nodes that converged still converged (e.g., do they show two members when you run the esxcli vsan cluster get)?

                             

                            In addition, can you please run the following command and paste up the output?

                            # cmmds-tool find -u aec35854-5321-3c01-1eea-1005ca9e2fa8 -f json -t DOM_OBJECT

                            • 11. Re: HELP! VSAN is empty after black out!
                              jetaylor Novice
                              VMware Employees

                              brugh2,

                               

                              Please also check your PM.

                              • 12. Re: HELP! VSAN is empty after black out!
                                brugh2 Enthusiast

                                the 2 nodes show 2 members when i type vsan cluster get and they both show the same 2 UUIDs show that looks ok.

                                as for the cmmds, there's a lot of output, no idea if that's complete or not. the owner is one of the 2 hosts.

                                 

                                ~ #  cmmds-tool find -u aec35854-5321-3c01-1eea-1005ca9e2fa8 -f json -t DOM_OBJECT

                                {

                                "entries":

                                [

                                {

                                   "uuid": "aec35854-5321-3c01-1eea-1005ca9e2fa8",

                                   "owner": "539a94ed-a020-3482-534b-1005ca9d8466",

                                   "health": "Healthy",

                                   "revision": "15",

                                   "type": "DOM_OBJECT",

                                   "flag": "2",

                                   "md5sum": "075b35a62336396933cfd22fc2a18470",

                                   "valueLen": "2616",

                                   "content": {"type": "Configuration", "attributes": {"CSN": 41, "addressSpace": 274877906944, "compositeUuid": "aec35854-5321-3c01-1eea-1005ca9e2fa8"}, "child-1": {"type": "RAID_1", "attributes": {}, "child-1": {"type": "RAID_0", "attributes": {"stripeBlockSize": 1048576}, "child-1": {"type": "Component", "attributes": {"addressSpace": 137438953472, "componentState": 6, "componentStateTS": 1415910583, "staleLsn": 0, "bytesToSync": 0, "recoveryETA": 0, "faultDomainId": "539a9749-dfe0-e17f-c529-1005ca9e2fa8"}, "componentUuid": "d1c25c54-9e6c-ffdb-4ec6-1005ca9e2fa8", "diskUuid": "5268d4b5-d67b-4962-f3a8-26542a6c0558"}, "child-2": {"type": "Component", "attributes": {"addressSpace": 137438953472, "componentState": 6, "componentStateTS": 1415910583, "staleLsn": 0, "bytesToSync": 0, "recoveryETA": 0, "faultDomainId": "539a9749-dfe0-e17f-c529-1005ca9e2fa8"}, "componentUuid": "d1c25c54-b83e-02dc-173f-1005ca9e2fa8", "diskUuid": "52c54a3f-9c2c-8c6d-f46d-610a7998993a"}}, "child-2": {"type": "RAID_0", "attributes": {"stripeBlockSize": 1048576}, "child-1": {"type": "Component", "attributes": {"addressSpace": 137438953472, "componentState": 5, "componentStateTS": 1415373588, "staleLsn": 0, "bytesToSync": 0, "recoveryETA": 0, "faultDomainId": "539a94ed-a020-3482-534b-1005ca9d8466"}, "componentUuid": "9bda5c54-1807-88d5-a8d1-1005ca9e2fa8", "diskUuid": "52afb34e-d1e6-d142-1c71-7098718817e6"}, "child-2": {"type": "Component", "attributes": {"addressSpace": 137438953472, "componentState": 6, "componentStateTS": 1415906710, "staleLsn": 9162486, "staleCsn": 40, "bytesToSync": 0, "recoveryETA": 0, "faultDomainId": "539a9751-676c-b12c-db63-1005ca9decca"}, "componentUuid": "9bda5c54-bb12-8bd5-1a8a-1005ca9e2fa8", "diskUuid": "521b3674-ae89-6355-70c7-cd5ef9d9d014"}}}, "child-2": {"type": "Witness", "attributes": {"componentState": 5, "componentStateTS": 1415915399, "staleLsn": 0, "staleCsn": 0, "isWitness": 1, "faultDomainId": "539a9751-676c-b12c-db63-1005ca9decca"}, "componentUuid": "9bda5c54-2be7-8ed5-edb2-1005ca9e2fa8", "diskUuid": "5242203b-d4c3-e0e0-b840-ae94aa12707c"}, "child-3": {"type": "Witness", "attributes": {"componentState": 5, "componentStateTS": 1415371419, "isWitness": 1, "faultDomainId": "539a94ed-a020-3482-534b-1005ca9d8466"}, "componentUuid": "9bda5c54-2e29-90d5-8402-1005ca9e2fa8", "diskUuid": "526674f8-c398-1ffc-b0f3-eea569b05ab2"}, "child-4": {"type": "Witness", "attributes": {"componentState": 6, "componentStateTS": 1415910583, "isWitness": 1, "faultDomainId": "539a9749-dfe0-e17f-c529-1005ca9e2fa8"}, "componentUuid": "9bda5c54-af88-91d5-1630-1005ca9e2fa8", "diskUuid": "5268d4b5-d67b-4962-f3a8-26542a6c0558"}},

                                   "errorStr": "(null)"

                                }

                                ]

                                }

                                • 13. Re: HELP! VSAN is empty after black out!
                                  jetaylor Novice
                                  VMware Employees

                                  Okay, so that output indicates that we have missing components on the following:

                                  Host 539a9749-dfe0-e17f-c529-1005ca9e2fa8 --> The host from your initial post.

                                       Disk: 5268d4b5-d67b-4962-f3a8-26542a6c0558

                                       Disk: 52c54a3f-9c2c-8c6d-f46d-610a7998993a

                                  Host: 539a9751-676c-b12c-db63-1005ca9decca

                                       Disk: 521b3674-ae89-6355-70c7-cd5ef9d9d014

                                  (at least for this one object). We don't have a complete mirror and thus this object (among others, presumably) is offline.

                                   

                                  It is possible that during the blackout one host went down slightly later and thus has more-recent data, so other things are being held offline. We will need to look more closely to find out more.

                                  As we discussed offline, I will try to follow up with you tomorrow.

                                  • 14. Re: HELP! VSAN is empty after black out!
                                    brugh2 Enthusiast

                                    'held offline' sounds like 'not lost' which would be wonderful. http://kb.vmware.com/kb/2059091 and http://kb.vmware.com/kb/1012864 to me suggest that when i add it back to the cluster (probably remove it from it's own cluster remnant first), it would start syncing things and eventually all data would be back online. i will do that as a last resort but hopefully get some troubleshooting done first to make sure that such an action doesnt actually destroy any data that may still be there.

                                    1 2 Previous Next