VMware Cloud Community
diegoarroyo
Contributor
Contributor

vsphere 7 Disconnected from Host: Reason License expired, althouth still valid on Host. Cannot recovery data from vsan

Good Morning,

I started an vsphere 7 evaluation with vSAN. As a lab I usually shutdown (ordered) all components: (First, vm, then vcenter, then hosts).

Today, I cannot start some vm, and I see that one host is disconnected. (reason on vcenter: License expired)

I checked on Host: You are currently using ESXi in evaluation mode. This license will expire in 60 days.

Installation was about two weeks ago, all servers were installed on the same hour, vcenter was also deployed. VSAN was configured, some machines created, deleted and all vsan parameters and status seems ok.

Time was set from beginning with ntp (all componets are syncronized with a pair of ntp servers).

On problem host, ntp start with host was not working. This server seems to have a host battery problem, and all servers were disconnected from pdu yesterday due to maintenance on lab (they were powered off)

After correct time, set ntp again, and restart server, still does not connect to vcenter.

Problems I face are two:

* Cannot connect that server to cluster after the clock problem. -> I could reinstall that server, but, the real problem is:

* Cannot recover virtualmachines storage on that server although they are replicated on vsan on "raid 1" configuration and cluster health before shutdown all machines was green

Policy:

General

  Name     vSAN Default Storage Policy

  Description     Storage policy used as default for vSAN datastores

Rule-set 1: VSAN

  Placement

    Storage Type     VSAN

    Site disaster tolerance     None - standard cluster

    Failures to tolerate     1 failure - RAID-1 (Mirroring)

    Number of disk stripes per object     1

    IOPS limit for object     0

    Object space reservation     Thick provisioning

    Flash read cache reservation     0%

    Disable object checksum     No

    Force provisioning     No

Any clue to recover data?

Data is not important, was dummy virtual machines, but had plan to test vsan resilience and the cascade problem caused but one tiny hardware problem seems that is not working as expected.

Thanks and Best Regards,

Tags (2)
Reply
0 Kudos
7 Replies
TheBobkin
Champion
Champion

Hello diegoarroyo​,

Welcome to Communities.

"(reason on vcenter: License expired)"

If you wish to use vCenter then acquire a valid license - if this is a homelab you can get NFR vSphere licenses from VMUG Advantage membership which isn't too costly.

"* Cannot connect that server to cluster after the clock problem. -> I could reinstall that server, but, the real problem is:"

What do you mean by this? one/some hosts are partitioned from the vSAN cluster? This can be validated simply via SSH to the hosts and checking the cluster membership count from:

# esxcli vsan cluster get

"* Cannot recover virtualmachines storage on that server although they are replicated on vsan on "raid 1" configuration and cluster health before shutdown all machines was green"

Please indicate what you mean here - it is expected to not be able to power on VMs via vSphere client if the hosts are disconnected - validate that you can power on the VMs via the Host clients or via SSH (e.g. vim-cmd vmsvc/power.on <VM-ID>).

"Data is not important, was dummy virtual machines, but had plan to test vsan resilience and the cascade problem caused but one tiny hardware problem seems that is not working as expected"

The data is likely fine - vSAN doesn't need vCenter available to maintain the integrity of the data.

Bob

Reply
0 Kudos
diegoarroyo
Contributor
Contributor

Thanks Bob,

As I posted, I started an evaluation of the product, so it is a registered 60 day evaluation license with only few hours of use. Also the reason of this evaluation is to check if it works for our needs before buy it.

I am going to try to explain the problem better:

1.- Battery fail on one host

2.- That host was powered of, and disconnected from power supply

3.- Host is started Again

4.- Time is not synchronized although is configured (I have checked again the configuration on each of the servers and all had ntp service startup policy: "Start and Stop with host", but after restart servers that service never get up) -> Bug?

5.- As host has wrong time, when contact with vcenter, vcenter decides that license is not longer valid and kick it out of the cluster. (Disconnected from Host: Reason License expired)

-> Cannot connect again Manually (No licenses available message). License on Host still valid (Evaluation 60 days remain) and vcenter evaluation license also still valid)

6.- Data of virtual machines assigned to execute on the resource pool assigned to that hosts is lost.

The problem is how to recover step 6. I though that with raid 1 redundancy all data was stored twice in two parts of the vsan cluster, but I could not find a way to recover it.

As a side note, if I remove the failed host from cluster, and add it again, It is again valid to work on cluster, but data is still lost.

Best Regards,

Diego

Reply
0 Kudos
diegoarroyo
Contributor
Contributor

I forget,

Before add host again to the cluster, virtual machines assigned to the resource pool of that host, were marked as invalid.

On the same host were with uids instead of names, and on vcenter with name, but could not rebalance them, replicate or anything.

After delete host and connect to the cluster, virtual machines assigned to that resource pool were still invalid. All other virtual machines that was not on that resource pool were and are fine.

Best Regards,

Diego

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Diego,

"As a side note, if I remove the failed host from cluster, and add it again, It is again valid to work on cluster, but data is still lost."

Did you do anything with the Disk-Groups? Leaving a cluster doesn't remove these and thus doesn't affect the state of the data (other than it is obviously not available until it rejoins the cluster).

"virtual machines assigned to the resource pool of that host, were marked as invalid.

On the same host were with uids instead of names"

Yes, likely because the namespaces Objects were inaccessible - this is expected behaviour - once they become available again these should clean up automatically to their friendly-names but can sometimes need manual intervention via RVC (vsan.fix_renamed_vms).

I want to better understand the current state of the data, if the cluster is fully formed, please share the output of these command run on any host in the cluster:

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

# cmmds-tool find -t NODE_DECOM_STATE

Also if you can copy the current layout of a few inaccessible objects (if any) from the data generated via:

# esxcli vsan debug object list > /tmp/objout

Reply
0 Kudos
diegoarroyo
Contributor
Contributor

Thanks Bob,

I tryed to reproduce, but this time does not fail. I paste here steps and output.

Next week I will reinstall all the environment and try to reproduce it again.

Output in all nodes is the same. First "clean" output:

[root@vm-1:~] cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6| grep 'uuid\|content'  | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

     72 state\": 7

[root@vm-1:~] cmmds-tool find -t NODE_DECOM_STATE

owner=5eaac3e5-9504-3788-e531-7845c4f9d1f1(Health: Healthy) uuid=5eaac3e5-9504-3788-e531-7845c4f9d1f1 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac541-9d77-c3a4-43ae-7845c4f9d245(Health: Healthy) uuid=5eaac541-9d77-c3a4-43ae-7845c4f9d245 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf(Health: Healthy) uuid=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=4d1e9435-49a8-aadd-8564-7845c4f9ce31(Health: Healthy) uuid=4d1e9435-49a8-aadd-8564-7845c4f9ce31 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

Node that have battery clock problems is vm-3. Before shutdown and unplug power supplies has 4 virtual machines test-3 ns-11 ns-15 and ns-26.

Reproducing the problem: Stopping all again. unplug power supply and power it on.

On all nodes ntp service did not start (was started before shutdown and is configured to start with host).

vm-3 as has the battery clock problem shows date on 1st January 2011.

All virtual machines are syncronized from vsan, and I could start any of them from each host (vcenter is still offline). Output of the command is the same

[root@vm-1:~] cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6| grep 'uuid\|content'  | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

     54 state\": 7

[root@vm-3:~] cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6| grep 'uu

> id\|content'  | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

     54 state\": 7

[root@vm-3:~] cmmds-tool find -t NODE_DECOM_STATE

owner=5eaac3e5-9504-3788-e531-7845c4f9d1f1(Health: Healthy) uuid=5eaac3e5-9504-3788-e531-7845c4f9d1f1 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac541-9d77-c3a4-43ae-7845c4f9d245(Health: Healthy) uuid=5eaac541-9d77-c3a4-43ae-7845c4f9d245 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf(Health: Healthy) uuid=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=4d1e9435-49a8-aadd-8564-7845c4f9ce31(Health: Healthy) uuid=4d1e9435-49a8-aadd-8564-7845c4f9ce31 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

[root@vm-1:~] date

Fri May 15 18:43:10 UTC 2020

[root@vm-2:~] date

Fri May 15 18:43:13 UTC 2020

[root@vm-3:~] date

Sat Jan  1 00:16:26 UTC 2011

[root@vm-4:~] date

Fri May 15 18:43:10 UTC 2020

Now I start vcenter virtual machine (is on node vm-4).

On console seems ok now (problem was not reproduced).

I got a message: vSAN Cluser: vSAN health alarm 'Time is not synchronized across hosts and VC' That is what I should expected on first time it happens.

Well not really expected as I configured ntp on all nodes and vcenter, but ntp did not start automatic on boot on this vsphere 7 build.

Best Regards,

Diego

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Diego,

All Objects have Config-status 7 which means they are healthy and accessible - thus I don't see the issue here (other than NTP/licensing etc.).

If you still have VMs named with directory paths then this is easily resolved:

https://www.virten.net/2017/07/vsan-6-6-rvc-guide-part-6-troubleshooting/#vsan-fix_renamed_vms

Bob

Reply
0 Kudos
diegoarroyo
Contributor
Contributor

Hello Bob,

I think I was so fast saying that was not reproduced:

Today vm-3 is disconnected. Expired host license message on vsphere.

Also this two more error messages:

vSAN health alarm 'Hosts disconnected from VC'

vSAN health alarm 'vSphere cluster members do not match vSAN cluster members'

Time was corrected yesterday, and still is ok:

[root@vm-1:~] date   Sat May 16 09:22:04 UTC 2020

[root@vm-2:~] date   Sat May 16 09:22:04 UTC 2020

[root@vm-3:~] date   Sat May 16 09:22:03 UTC 2020

[root@vm-4:~] date   Sat May 16 09:22:04 UTC 2020

Also health output on nodes is ok:

[root@vm-3:~] cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6| grep 'uuid\|content'  | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

     55 state\": 7

[root@vm-3:~] cmmds-tool find -t NODE_DECOM_STATE

owner=5eaac3e5-9504-3788-e531-7845c4f9d1f1(Health: Healthy) uuid=5eaac3e5-9504-3788-e531-7845c4f9d1f1 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac541-9d77-c3a4-43ae-7845c4f9d245(Health: Healthy) uuid=5eaac541-9d77-c3a4-43ae-7845c4f9d245 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf(Health: Healthy) uuid=5eaac5fa-93c4-90b6-991f-7845c4f9cdbf type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

owner=4d1e9435-49a8-aadd-8564-7845c4f9ce31(Health: Healthy) uuid=4d1e9435-49a8-aadd-8564-7845c4f9ce31 type=NODE_DECOM_STATE rev=0 minHostVer=0  [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0 "" i0 i0 l0 l0)], errorStr=(null)

Virtual machines ns-11 ns-15 ns-26 and test-3 are disconnected (the ones in execution last time on vm-3)

vSAN Skyline health shows

in Network: Host disconnected from VC (vm-3)

in Cluster: vSphere cluster members match vSAN cluster members error, because is no in vsphere cluster, but it is in vsan cluster

in Virtual Objects: "There are connectivity issues in this cluster. One or more hosts are unable to communicate with the vSAN datastore. Data below does not reflect the real state of the system."

-> Placement and Availability status shows Healthy:54, but not other problem, and did not change after retest.

If I log-in in vm-3, I can see the data of all 4 vm (also power on button can be press, but I have not powered on any machine from the host)

From vcenter if I right click the host and select connection -> connect

I got a normal warning: Reconnecting a host will override any resource management changes that were made directly on the host while it was disconnected. To keep these changes, remove host from the inventory and add it again

And error: Cannot complete the license assignment operation.

A dialog box open like when you add a host to vcenter, and ask for confirmation about host, introduce credentials of the host, and after finish same error:

Cannot complete the license assignment operations.

Details of the error:

Error stack:

"vSphere vMotion"

"vSphere FT"

"vSphere DRS"

"vCenter agent for VMware host"

The Evaluation Mode license on "Host" 192.168.255.25  can not be changed to . To downgrade the license, first disable the following features.

At this moment, what should be the recovery process?

-> I can not migrate virtual machines from vcenter, neither clone them as they are disconnected. Edit settings also is disabled.

Last time at this point, I rebooted host to see if with time syncronized was recognized by vcenter, but was not, and after that also reboot vcenter.

At that point virtual machines were worse than at this moment.

I think if I do that same stepse I will lost access to the data again and also if I remove vm-3 from inventory and add it again (that step worked last time to could use vm-3 in cluster without need to reinstall anything)

Best Regards,

Diego Arroyo

Reply
0 Kudos