ssSFrankSss
Enthusiast
Enthusiast

I have lost my vSAN. Could you help?

Jump to solution

Is there somebody who can help me retrieve my vSAN setup?

That day was sooo bad, I was making bad decisions one after the other. I can not believe my self. Literally I can not believe myself Smiley Sad

Let you describe my set up briefly: 3 nodes Dell r720xd, with 3 SSDs each. vCenter run in vSAN.

1) I loose my UPS an APC. I send it to be fixed, meanwhile I used an old one with kink of old batteries. They have never returned this to me. So I leave the old one.

- I never change it. I say OK it works for some minutes we are fine.

2)  I loose one SSD the system works fine.

- I say OK it works, I am too busy lets do something else.

Now the things go reallyyyy bad! That day arrives and due to strong winds we had power outages in the night. I am arriving at the place in the morning. I could not locate any VMs. I say OK lets look maybe I have to turn them on. I log in to ESXi. It seems vCenter was not working I tried to restart the server from ESXi the Task was loading for ever. And I could not open any VM or do anything.

so...

3) I am restarting simultaneously the three servers. (It is bad I know, don't ask me why I did that) After that restart, every VM was appearing as Status: "Invalid". But I could see the vSAN. I could not register any VM at that point since when I was opening the explorer I could not see any files. At that time I could see the capacity of vSAN.

4) I said OK let me close them one by one. Open them get the dust out because from the years they had much dust, and then see if something is going to be fixed! I opened them cleared them carefully but  now one out of three servers didn't want to open at all! LOL it went from bad to worse. At this point vSAN was appearing as 0 from the other two servers!

5) I was so sad and mad and without clear mind I say ok lets retrieve from an older back up. And lets make everything from beginning. So out of the 2 running servers I am exiting one server from vSAN cluster and try to erase the one disk in order to start making from the top everything. At that point I remember that I have some really important files that I have not backed up.

Long story short I have one server that can not be turned on (I have tried to change power supplies no luck). And one out of the two working servers with one erased SSD (out of 3 SSDs) and removed also from the vSAN cluster.

Since I have some really important files which I want to retrieve. I would appreciate if somebody could help me with this situation.

I think even the God will find this difficult. Or maybe a real vSAN and server expert could help me. I am really sad with my bad decisions. I do not have any support at this time. So I would be glad to discuss a payment in PM if somebody could help me!

Thank you in advance,

Frank

1 Solution

Accepted Solutions
TheBobkin
VMware Employee
VMware Employee

Hello Frank,

"Sub-Cluster Member Count: 1"

They are not members of the same cluster

Because you used node esxi10's UUID instead of the cluster UUID:

"[root@esxi30:~] esxcli vsan cluster get

...

   Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368"

So basically you created a new cluster with that UUID.

Leave and rejoin the cluster correctly using on esxi30:

# esxcli vsan cluster leave

# esxcli vsan cluster join -u 529f57d0-a063-c30e-191f-8c9dab9faada

You may need to manually repopulate the unicastagent lists:

VMware Knowledge Base

Bob

View solution in original post

22 Replies
depping
Leadership
Leadership

Did you try contacting VMware Support? If possible at all, which sounds unlikely, it seems like they would be your best option!

depping
Leadership
Leadership

Okay, I have a crazy idea, which may work...

You have 2 working servers, or at least 2 servers that can be powered on. Considering the host with the failed SSD was the first failure in your environment, the below may just work. I provide no guarantee, and can't be held liable in ANY SHAPE OR FORM!

I would try the following:

  • Physically mark my SSDs
  • Remove the ALL the SSDs from the host of which an SSD failed  and don't touch them again
  • Remove the SSDs from the Host which cannot be powered on any longer
  • Place the SSDs from the host that could not be powered on in the host where the SSD failed in the same order.
  • Power on the host with the "new" SSDs
  • Wait and hope the objects become accessible again.
ssSFrankSss
Enthusiast
Enthusiast

Νo as I do not have any support subscription... This why I am asking to see if a vSAN expert could help me with payment of course!

0 Kudos
depping
Leadership
Leadership

See the above, it may just work. That is what I would do to be honest.

ssSFrankSss
Enthusiast
Enthusiast

This is a crazy idea indeed haha. I will try tomorrow although it is highly unlikely. In any case if it works. I will ask for your PayPal account to give you a donation :smileysilly:

0 Kudos
depping
Leadership
Leadership

Nah, I work for VMware, I can't take any money. But theoretically it has a chance of success. As the host that died last should have the latest data just like the host which is 100% healthy. So if you combine the good disks with the good server you should be able to power on VMs again.

TheBobkin
VMware Employee
VMware Employee

Hello Frank,

I was going to suggest as Duncan said to remove the disks from the host you cannot power them on and put them in a server that you can power on but that doesn't have functional/current data.

However, before you do this - can you add the host you removed from the cluster back (either via the UI or via the CLI if on an older build) and share the output of:

# esxcli vsan debug object list

If this is an older build which doesn't have this command then similar data can be generated using:

# python /usr/lib/vmware/vsan/bin/vsan-health-status.pyc > /tmp/healthOut.txt

If it cannot tell the state of the data due to everything being inaccessible then the CMMDS output should tell us what the state of the components are:

# cmmds-tool find -t DOM_OBJECT -f json > /tmp/DOMOut.txt

# cmmds-tool find -t HOSTNAME -f json> /tmp/HOSTNAMEOut.txt

# cmmds-tool find -t NODE_DECOM_STATE -f json> /tmp/DECOMOut.txt

Note that VMware offer pay-per-incident support that you may be able to avail of here.

Anyone that knows enough about vSAN to help here probably works for VMware GSS/PSO and thus won't be able to accept payment as this would likely violate the terms of our contracts.

Bob

Edit: added a command.

ssSFrankSss
Enthusiast
Enthusiast

Hi,

I didn't know this. OK I will see the "pay-per-incident" as my last option then! I really appreciate your help because I am in a difficult position!!!!!

Because I am really curious with the help of you guys I found the keys and I am alone here with the servers around me! I can not wait by tomorrow Smiley Happy

So what I was thinking is to change power supplies and then see if I have faulty cards. This is to start the second server that doesn't power on - then put working SSDs. I am not sure how to add working server to vSAN cluster without vCenter... So I will try to power on the second server.

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello Frank,

Sorry, you said you didn't have the vCenter already so CLI it is.

What build of ESXi is in use? This will determine whether you may have to manually add entries to the unicastagent lists.

If it is 6.0/<6.5P01 and not a stretched cluster then it is just a case of validating that you have the vSAN vmk configured in the same subnet as the other hosts vSAN vmk and join the cluster:

On the node that never left cluster:

# esxcli vsan cluster get

Note the 'Sub-Cluster UUID'

On the node that you left cluster on:

# esxcli vsan cluster join -u <sub-Cluster UUID>

As I said above, I think it is worthwhile checking the state of the data components (and where the active/absent/degraded ones reside) with what you have now before starting switching hardware components (so that we can validate whether that will help).

Bob

ssSFrankSss
Enthusiast
Enthusiast

Hi,

Yes that worked now with the server that returned I have some info as you see in attachment. But, the commands above does not show anything on new server. On server that has never left. I get some feedback from above commands. What do you advise me to do now?

0 Kudos
ssSFrankSss
Enthusiast
Enthusiast

Please clear some things for me, would it be best to:

A) Try to turn on second Server from the one that has left the cluster parts and put working SSDs?

B) Try to put working SSDs in newly joined server?

Also I would like to know suppose I have three SSDs one with 480GB, one with 1.6TB and one with 1.92TB on 3 servers - same setup

if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?

I am saying this because I do not remember if I have on failed disk or two...

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello Frank,

That likely won't return any useful data when you still have just one node partitioned by itself.

Have you attempted rejoining the cluster on the node that you erased an SSD?

"if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?

I am saying this because I do not remember if I have on failed disk or two..."

Do you have failed disks or disk(s) that you wiped?

Do you recall what disk(s) are failed/wiped? e.g. Capacity-tier or Cache-tier - if you wipe a Cache-tier device the Disk-Group is gone and no you cannot add Capacity-tier devices with data on them to an another existing Disk-Group.

Bob

ssSFrankSss
Enthusiast
Enthusiast

Hi!

It looks like I am getting somewhere!!!!!!!!!!!!!

Let me explain to clear things up it is important. Suppose I have three servers lets name them

esxi10: This is the working server that it has never left vSAN cluster (vCenter lives in there also), and has no faulty SSDs

esxi20: This is the server that does not power on, it has never left vSAN - I do not remember if I have faulty SSDs

esxi30: This is the server I joined back the cluster, it has one wiped SSD and one SSD that does not work at all.

OK I put esxi30 in maintenance mode, and put all the disks of esxi20 that does not power on to esxi30. It looks like all three SSDs are working fine (Phewww). And esxi30 can now see the vSAN datastore :smileygrin:

esxi10 continues to sees vSAN as 0. How should I continue?!

0 Kudos
ssSFrankSss
Enthusiast
Enthusiast

OK I have tried to restart services for esxi10 with:

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Now it looks like esxi10 sees the vSAN also! Now I came to my initial state (step "3") -> where I could not access the VMs after restart but could see the vSAN datastore.

Please check attachments.

1) I can not unregister and register VMs as vSAN appears empty in explorer.

2) I can not power on any machine (it is grayed out) as they appear invalid.

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello Frank,

" And esxi30 can now see the vSAN datastore

esxi10 continues to sees vSAN as 0. How should I continue?!"

Please share the output from both nodes of:

# df -h

# esxcli vsan cluster get

# vdq -Hi

# esxcli vsan storage list

# cmmds-tool find -t HOSTNAME -f json

# cmmds-tool find -t NODE_DECOM_STATE -f json

If a node is properly clustered with other nodes then it should see the size of the vsanDatastore as the total of its own storage and the other clustered nodes storage (unless they are a) in Maintenance Mode or b) have their local-storage unmounted/inaccessible).

Bob

0 Kudos
ssSFrankSss
Enthusiast
Enthusiast

Did you see my last post? Now both servers see the vSAN

0 Kudos
ssSFrankSss
Enthusiast
Enthusiast

For esxi10:

[root@esxi10:~] df -h

Filesystem   Size   Used Available Use% Mounted on

VMFS-6      68.2G   5.3G     63.0G   8% /vmfs/volumes/DatastoreHP 1

VMFS-6       3.6T 151.3G      3.4T   4% /vmfs/volumes/NAS BackUP

vfat       285.8M 209.1M     76.8M  73% /vmfs/volumes/5aef06c6-480da194-cbd0-a03                      69f1fd368

vfat       249.7M 159.2M     90.5M  64% /vmfs/volumes/7e39d0ef-9c1fd3eb-730f-029                      c611a571f

vfat       249.7M 151.5M     98.3M  61% /vmfs/volumes/dc7018e8-ba3651d8-949c-d20                      a095b30e1

vsan         1.7T 964.3G    824.2G  54% /vmfs/volumes/vsanDatastore

[root@esxi10:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2020-04-21T18:29:09Z

   Local Node UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 529f57d0-a063-c30e-191f-8c9dab9faada

   Sub-Cluster Membership Entry Revision: 14

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

   Sub-Cluster Membership UUID: 95ee975e-2188-ae76-a2f7-a0369f1fd368

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: 80535089-0092-41b3-93dc-3df97c24b6b0 4 2020-03-27T10:40:06.555

[root@esxi10:~] vdq -Hi

Mappings:

   DiskMapping[0]:

           SSD:  naa.5000c5003017925b

            MD:  naa.5000c5003018c6d3

[root@esxi10:~] esxcli vsan storage list

naa.5000c5003017925b

   Device: naa.5000c5003017925b

   Display Name: naa.5000c5003017925b

   Is SSD: true

   VSAN UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

   VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

   VSAN Disk Group Name: naa.5000c5003017925b

   Used by this host: true

   In CMMDS: true

   On-disk format version: 5

   Deduplication: false

   Compression: false

   Checksum: 7053597502770896794

   Checksum OK: true

   Is Capacity Tier: false

   Encryption: false

   DiskKeyLoaded: false

   Creation Time: Tue May  8 11:33:35 2018

naa.5000c5003018c6d3

   Device: naa.5000c5003018c6d3

   Display Name: naa.5000c5003018c6d3

   Is SSD: true

   VSAN UUID: 52a1d26c-9f75-4646-9e1b-203690ad4d57

   VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8

   VSAN Disk Group Name: naa.5000c5003017925b

   Used by this host: true

   In CMMDS: true

   On-disk format version: 5

   Deduplication: false

   Compression: false

   Checksum: 13265426673954443271

   Checksum OK: true

   Is Capacity Tier: true

   Encryption: false

   DiskKeyLoaded: false

   Creation Time: Tue May  8 11:33:35 2018

[root@esxi10:~] cmmds-tool find -t HOSTNAME -f json

{

"entries":

[

{

   "uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

   "owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

   "health": "Healthy",

   "revision": "0",

   "type": "HOSTNAME",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "148ef7e719a8a60fe2691226efc28b1b",

   "valueLen": "32",

   "content": {"hostname": "esxi10.virtual.store"},

   "errorStr": "(null)"

}

,{

   "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "health": "Unhealthy",

   "revision": "0",

   "type": "HOSTNAME",

   "flag": "0",

   "minHostVersion": "0",

   "md5sum": "2215d3424456e0aaf039f1e639e0014d",

   "valueLen": "32",

   "content": {"hostname": "esxi30.virtual.store"},

   "errorStr": "(null)"

}

]

}

[root@esxi10:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

   "uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

   "owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",

   "health": "Healthy",

   "revision": "10",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

,{

   "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "health": "Unhealthy",

   "revision": "0",

   "type": "NODE_DECOM_STATE",

   "flag": "0",

   "minHostVersion": "0",

   "md5sum": "ccd967c4aed2b1781c86c6e18e5d8348",

   "valueLen": "80",

   "content": {"decomState": 6, "decomJobType": 1, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 5},

   "errorStr": "(null)"

}

]

}

[root@esxi10:~]

0 Kudos
ssSFrankSss
Enthusiast
Enthusiast

For esxi30:

[root@esxi30:~] df -h

Filesystem   Size   Used Available Use% Mounted on

VMFS-6      68.2G   5.3G     63.0G   8% /vmfs/volumes/DatastoreHP 2

VMFS-6       3.6T 151.3G      3.4T   4% /vmfs/volumes/NAS BackUP

vfat       249.7M 159.3M     90.5M  64% /vmfs/volumes/38700cee-efb20e12-8f17-ea7cb3da94b7

vfat       285.8M 209.1M     76.8M  73% /vmfs/volumes/5aef35d7-7f17cad8-5f16-a0369f1fd36c

vfat       249.7M 151.5M     98.3M  61% /vmfs/volumes/b7cdaaec-14e48f24-4c35-859dabb8d5b9

vsan         1.7T 768.6G   1019.9G  43% /vmfs/volumes/vsanDatastore

[root@esxi30:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2020-04-21T18:33:22Z

   Local Node UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 5aef3508-100f-974c-ba2e-a0369f1fd36c

   Sub-Cluster Membership UUID: 6d1a9f5e-7c21-2810-3f16-a0369f1fd36c

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: None 0 0.0

[root@esxi30:~] vdq -Hi

Mappings:

   DiskMapping[0]:

           SSD:  naa.5000c50030176437

            MD:  naa.5000c5003015fc67

[root@esxi30:~] esxcli vsan storage list

naa.5000c50030176437

   Device: naa.5000c50030176437

   Display Name: naa.5000c50030176437

   Is SSD: true

   VSAN UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

   VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

   VSAN Disk Group Name: naa.5000c50030176437

   Used by this host: true

   In CMMDS: true

   On-disk format version: 5

   Deduplication: false

   Compression: false

   Checksum: 584334970149048049

   Checksum OK: true

   Is Capacity Tier: false

   Encryption: false

   DiskKeyLoaded: false

   Creation Time: Tue May  8 14:33:49 2018

naa.5000c5003015fc67

   Device: naa.5000c5003015fc67

   Display Name: naa.5000c5003015fc67

   Is SSD: true

   VSAN UUID: 52c7e41d-69a6-bf50-ff6f-0988952f2379

   VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389

   VSAN Disk Group Name: naa.5000c50030176437

   Used by this host: true

   In CMMDS: true

   On-disk format version: 5

   Deduplication: false

   Compression: false

   Checksum: 3089451523050927556

   Checksum OK: true

   Is Capacity Tier: true

   Encryption: false

   DiskKeyLoaded: false

   Creation Time: Tue May  8 14:33:49 2018

[root@esxi30:~] cmmds-tool find -t HOSTNAME -f json

{

"entries":

[

{

   "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "health": "Healthy",

   "revision": "0",

   "type": "HOSTNAME",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "2215d3424456e0aaf039f1e639e0014d",

   "valueLen": "32",

   "content": {"hostname": "esxi30.virtual.store"},

   "errorStr": "(null)"

}

]

}

[root@esxi30:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

   "uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",

   "health": "Healthy",

   "revision": "7",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

]

}

[root@esxi30:~]

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello Frank,

"Sub-Cluster Member Count: 1"

They are not members of the same cluster

Because you used node esxi10's UUID instead of the cluster UUID:

"[root@esxi30:~] esxcli vsan cluster get

...

   Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368"

So basically you created a new cluster with that UUID.

Leave and rejoin the cluster correctly using on esxi30:

# esxcli vsan cluster leave

# esxcli vsan cluster join -u 529f57d0-a063-c30e-191f-8c9dab9faada

You may need to manually repopulate the unicastagent lists:

VMware Knowledge Base

Bob

View solution in original post