1 Reply Latest reply on Nov 13, 2019 9:44 PM by ryanroy837

    VSA 5.5.0 Faulted State - Help!

    GetzVM Novice

      This issue started with the failure of the UPS that was providing power to a standalone storage device separate to the VSA and 1 of the 2 switches managing everything.

       

      This storage device provided storage for the vCenter 5.5u2 VM managing the 2 node VSA 5.5.0 cluster. The vCenter VM was being run on a server outside the VSA cluster, however the AD and DNS server VMs were running on the VSA servers using the VSA storage and have gone down with the ship and there aren't any backups of them.

       

      I was able to replicate DNS entries on a new DNS server (using the same IP and the original DNS server) for the VSA cluster and the VDP backup appliance. This backup appliance won't work without DNS and that's how I was able to get to the important backups and get them running again.

       

      The switch that went down was connected to half the NICs in each server the other half connected to the switch that stayed up (it was connected to a separate UPS).

      When I first arrived onsite I simply connected the storage device and the switch to the good UPS and powered everything up. vCenter started up fine and from the VSA plugin I was able to see both the VSA appliances were in Maintenance Mode. I exited them from Maintenance Mode and watched them start syncing and all the VM's start to appear again in the VSA cluster.

       

      However, then I did something that changed everything, I decided to reboot the router that the whole network uses. Whilst in the past I have seen the VSA appliances sync successfully without the router being powered on as soon as I hit reboot on the router that's when everything stopped syncing and rebuilding in the VSA and it all went to plaid.

       

      There is also one other caveat, the router is software that runs in the VSA cluster on the VSA storage. The router actually runs in memory on a RAM disk and only touches the underlying storage during config changes (or reboots!) hence why is was still running in memory on the ESXi servers in the VSA cluster. At the time I thought that since the VSA had brought the router VM back up in the inventory I could reboot the router whilst waiting for the VSA appliances to finish syncing given that everything the VSA appliances are doing should be done over Layer2 and not require the router.

       

      I think there could two possible reasons for the failure during the reboot of the router:
      1 – the ESXi networking on the hosts in the VSA cluster was messed up during the vCenter crash and that the VSA replication was somehow relying on the router to function.
      2 – the act of rebooting the software router affected the VSA datastores during replication and corrupted them.

       

      So that's where we are now. The VSA manager shows the VSA cluster is up but all the datastores and appliances are offline. It shows that the VSA-0 replica datastore is degraded and the VSA-1 replica datastore is in Maintenance Mode but everything else appears as offline.

       

      Below you can see the output of the state of the cluster. What I also found though when looking at the VSAManager logs is this:
      [VCUserSession] [tomcat-http--18] ERROR – Unable to get Host's (172.16.10.11) Subnet/Gateway.
      java.lang.Exception: Unable to find VMkernel port group named "Management Network" on host 172.16.10.11. Please ensure that the port group exists.

      [VCUserSession] [tomcat-http--18] ERROR – Unable to get Host's (172.16.10.10) Subnet/Gateway.
      java.lang.Exception: Unable to find VMkernel port group named "Management Network" on host 172.16.10.10. Please ensure that the port group exists.

       

      and also:

       

      2019-09-09 07:51:30,264 2285 [VCUtils] [tomcat-http--18] ERROR – Unable to find portgroup: VSA-Front End. Returning -1 for VLAN ID.
      2019-09-09 07:51:30,280 2285 [VCUtils] [tomcat-http--18] ERROR – Unable to find portgroup: VSA-Back End. Returning -1 for VLAN ID.
      2019-09-09 07:51:30,509 444 [VCUserSession] [tomcat-http--18] ERROR – Failed to convert host data

       

      Assuming that Reason 1 was the cause of the failure after the router reboot and that the networking in the ESXi host was messed up after the vCenter crash I moved the Management Network vmk and all the VSA portgroups to a new standard switch and connected the VSA appliances to this new switch. This fixed the Subnet/Gateway and Portgroup errors above but sadly the VSA appliances didn't spring back to life.

       

      I have found an example of how to access the underlying data in the appliances here http://notes.doodzzz.net/2014/09/02/vmware-vsa-export-vms-from-broken-appliances/

      however before I go destroying the whole VSA cluster are there any suggestions on how I might kick-start these appliances into action again?..

       

      Cheers.

       

       

      Storage Cluster ID = 2ad6bfee-2ee8-426d-b4c0-d8f6279e77e0
      Name = vStorage Cluster
      Maintenance mode = false
      Management interface = 172.16.10.40/24
      Master = 2b105af9-1997-44e9-a090-c9dba81c7f5e
      VSA Cluster Service address = 172.16.10.41
      VSA Cluster Service state = true
      Physical capacity = 7765229568 KB (7405.50 GB)
      Storage capacity = 3882614784 KB (3702.75 GB)

          1. Members:
            Member 0:
            SVA ID = df3494a3-e098-451b-b1a1-0dfd4239a26d
            Name = localhost
            Maintenance mode = false
            Domain name = localdom
            Member ID = 5f513c27-7b5f-43d4-9789-a512b4d1412d
            Storage Cluster ID = 2ad6bfee-2ee8-426d-b4c0-d8f6279e77e0
            Primary member = true
            State = ONLINE
            Management interface = 172.16.10.42/24
            Internal interface = 192.168.75.1/24
            No DNS servers
            Gateway = 172.16.10.1
            Total storage = 3882614784KB
            Free storage = 0KB
            Used storage = 3882614784KB
            Storage pool 0:
            ID = 00000000-0000-0000-0000-000000000000
            Total storage = 3882614784
            Free storage = 0
            Used storage = 3882614784
            No exported storage entities
            Member 1:
            SVA ID = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            Name = localhost
            Maintenance mode = false
            Domain name = localdom
            Member ID = 35da34c2-e3e8-4d55-a8ce-4cb4fcd93a4e
            Storage Cluster ID = 2ad6bfee-2ee8-426d-b4c0-d8f6279e77e0
            Primary member = true
            State = ONLINE
            Management interface = 172.16.10.44/24
            Internal interface = 192.168.75.2/24
            No DNS servers
            Gateway = 172.16.10.1
            Total storage = 3882614784KB
            Free storage = 0KB
            Used storage = 3882614784KB
            Storage pool 0:
            ID = 00000000-0000-0000-0000-000000000000
            Total storage = 3882614784
            Free storage = 0
            Used storage = 3882614784
            Export storage entities:
            becfc35d-06db-445c-b43d-2810c1a4a355
            8a87c4b3-cfce-4cc9-a791-90632370e53a
          2. Storage Entities:
            Storage entity 0:
            NFSv3StorageEntity:
            Id = becfc35d-06db-445c-b43d-2810c1a4a355
            Name = NFSv3StorageEntityBlc-becfc35d-06db-445c-b43d-2810c1a4a355
            State = OFFLINE
            AccessInterface = 172.16.10.43/24
            Export Member SVA Id = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            # Exports = 1
            Export[0]:
            Name = NFSv3Export-1972c234-b489-4359-84e9-e027f15108a2
            Id = 1972c234-b489-4359-84e9-e027f15108a2
            StorageEntity Id = becfc35d-06db-445c-b43d-2810c1a4a355
            Export Path = /exports/1972c234-b489-4359-84e9-e027f15108a2
            Acl[] = [172.16.10.11 172.16.10.47 172.16.10.10 172.16.10.46 172.16.10.30 172.16.10.31 172.16.10.32 172.16.10.33 172.16.10.34 172.16.10.35 172.16.10.36 172.16.10.37 ]
            Volume:
            Name = Volume-ad89b545-3ec3-488c-a93e-991b97f5f73f
            Id = ad89b545-3ec3-488c-a93e-991b97f5f73f
            Size = 1941307392
            Primary Owner = df3494a3-e098-451b-b1a1-0dfd4239a26d
            Current Owner = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            Export Id = 1972c234-b489-4359-84e9-e027f15108a2
            State = OFFLINE
            Type = RAID1
            # Disks = 2
            Sync Percent = 100.00 %
            Sync Time-to-Complete = 0 sec.
            Sync Speed = 0 KB/s
            Disk[0]
            Name = Disk-4abe9360-c4ae-4fe5-a09a-41ba07121972
            Id = 4abe9360-c4ae-4fe5-a09a-41ba07121972
            Owner Id = df3494a3-e098-451b-b1a1-0dfd4239a26d
            State = AVAILABLE
            Synced = false
            Disk[1]
            Name = Disk-13e17bb7-021f-45d8-8ac2-69fd7b87b5f1
            Id = 13e17bb7-021f-45d8-8ac2-69fd7b87b5f1
            Owner Id = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            State = FAULTED
            Synced = true
            Storage entity 1:
            NFSv3StorageEntity:
            Id = 8a87c4b3-cfce-4cc9-a791-90632370e53a
            Name = NFSv3StorageEntityBlc-8a87c4b3-cfce-4cc9-a791-90632370e53a
            State = OFFLINE
            AccessInterface = 172.16.10.45/24
            Export Member SVA Id = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            # Exports = 1
            Export[0]:
            Name = NFSv3Export-14e0ff66-ebe6-491a-a3fb-2019ac7f9dec
            Id = 14e0ff66-ebe6-491a-a3fb-2019ac7f9dec
            StorageEntity Id = 8a87c4b3-cfce-4cc9-a791-90632370e53a
            Export Path = /exports/14e0ff66-ebe6-491a-a3fb-2019ac7f9dec
            Acl[] = [172.16.10.11 172.16.10.47 172.16.10.10 172.16.10.46 172.16.10.30 172.16.10.31 172.16.10.32 172.16.10.33 172.16.10.34 172.16.10.35 172.16.10.36 172.16.10.37 ]
            Volume:
            Name = Volume-689af189-6d6c-4aef-8ec1-9c6f4e53df5c
            Id = 689af189-6d6c-4aef-8ec1-9c6f4e53df5c
            Size = 1941307392
            Primary Owner = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            Current Owner = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            Export Id = 14e0ff66-ebe6-491a-a3fb-2019ac7f9dec
            State = OFFLINE
            Type = RAID1
            # Disks = 2
            Sync Percent = 100.00 %
            Sync Time-to-Complete = 0 sec.
            Sync Speed = 0 KB/s
            Disk[0]
            Name = Disk-faf578fa-6be6-4db5-9a3b-c6f2ebbf1de3
            Id = faf578fa-6be6-4db5-9a3b-c6f2ebbf1de3
            Owner Id = 2b105af9-1997-44e9-a090-c9dba81c7f5e
            State = FAULTED
            Synced = true
            Disk[1]
            Name = Disk-2fd63de5-52f8-4abf-9ce9-37ef61091fa9
            Id = 2fd63de5-52f8-4abf-9ce9-37ef61091fa9
            Owner Id = df3494a3-e098-451b-b1a1-0dfd4239a26d
            State = AVAILABLE
            Synced = false