Is there somebody who can help me retrieve my vSAN setup?
That day was sooo bad, I was making bad decisions one after the other. I can not believe my self. Literally I can not believe myself
Let you describe my set up briefly: 3 nodes Dell r720xd, with 3 SSDs each. vCenter run in vSAN.
1) I loose my UPS an APC. I send it to be fixed, meanwhile I used an old one with kink of old batteries. They have never returned this to me. So I leave the old one.
- I never change it. I say OK it works for some minutes we are fine.
2) I loose one SSD the system works fine.
- I say OK it works, I am too busy lets do something else.
Now the things go reallyyyy bad! That day arrives and due to strong winds we had power outages in the night. I am arriving at the place in the morning. I could not locate any VMs. I say OK lets look maybe I have to turn them on. I log in to ESXi. It seems vCenter was not working I tried to restart the server from ESXi the Task was loading for ever. And I could not open any VM or do anything.
so...
3) I am restarting simultaneously the three servers. (It is bad I know, don't ask me why I did that) After that restart, every VM was appearing as Status: "Invalid". But I could see the vSAN. I could not register any VM at that point since when I was opening the explorer I could not see any files. At that time I could see the capacity of vSAN.
4) I said OK let me close them one by one. Open them get the dust out because from the years they had much dust, and then see if something is going to be fixed! I opened them cleared them carefully but now one out of three servers didn't want to open at all! LOL it went from bad to worse. At this point vSAN was appearing as 0 from the other two servers!
5) I was so sad and mad and without clear mind I say ok lets retrieve from an older back up. And lets make everything from beginning. So out of the 2 running servers I am exiting one server from vSAN cluster and try to erase the one disk in order to start making from the top everything. At that point I remember that I have some really important files that I have not backed up.
Long story short I have one server that can not be turned on (I have tried to change power supplies no luck). And one out of the two working servers with one erased SSD (out of 3 SSDs) and removed also from the vSAN cluster.
Since I have some really important files which I want to retrieve. I would appreciate if somebody could help me with this situation.
I think even the God will find this difficult. Or maybe a real vSAN and server expert could help me. I am really sad with my bad decisions. I do not have any support at this time. So I would be glad to discuss a payment in PM if somebody could help me!
Thank you in advance,
Frank
Hello Frank,
"Sub-Cluster Member Count: 1"
They are not members of the same cluster
Because you used node esxi10's UUID instead of the cluster UUID:
"[root@esxi30:~] esxcli vsan cluster get
...
Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368"
So basically you created a new cluster with that UUID.
Leave and rejoin the cluster correctly using on esxi30:
# esxcli vsan cluster leave
# esxcli vsan cluster join -u 529f57d0-a063-c30e-191f-8c9dab9faada
You may need to manually repopulate the unicastagent lists:
Bob
Did you try contacting VMware Support? If possible at all, which sounds unlikely, it seems like they would be your best option!
Okay, I have a crazy idea, which may work...
You have 2 working servers, or at least 2 servers that can be powered on. Considering the host with the failed SSD was the first failure in your environment, the below may just work. I provide no guarantee, and can't be held liable in ANY SHAPE OR FORM!
I would try the following:
Νo as I do not have any support subscription... This why I am asking to see if a vSAN expert could help me with payment of course!
See the above, it may just work. That is what I would do to be honest.
This is a crazy idea indeed haha. I will try tomorrow although it is highly unlikely. In any case if it works. I will ask for your PayPal account to give you a donation :smileysilly:
Nah, I work for VMware, I can't take any money. But theoretically it has a chance of success. As the host that died last should have the latest data just like the host which is 100% healthy. So if you combine the good disks with the good server you should be able to power on VMs again.
Hello Frank,
I was going to suggest as Duncan said to remove the disks from the host you cannot power them on and put them in a server that you can power on but that doesn't have functional/current data.
However, before you do this - can you add the host you removed from the cluster back (either via the UI or via the CLI if on an older build) and share the output of:
# esxcli vsan debug object list
If this is an older build which doesn't have this command then similar data can be generated using:
# python /usr/lib/vmware/vsan/bin/vsan-health-status.pyc > /tmp/healthOut.txt
If it cannot tell the state of the data due to everything being inaccessible then the CMMDS output should tell us what the state of the components are:
# cmmds-tool find -t DOM_OBJECT -f json > /tmp/DOMOut.txt
# cmmds-tool find -t HOSTNAME -f json> /tmp/HOSTNAMEOut.txt
# cmmds-tool find -t NODE_DECOM_STATE -f json> /tmp/DECOMOut.txt
Note that VMware offer pay-per-incident support that you may be able to avail of here.
Anyone that knows enough about vSAN to help here probably works for VMware GSS/PSO and thus won't be able to accept payment as this would likely violate the terms of our contracts.
Bob
Edit: added a command.
Hi,
I didn't know this. OK I will see the "pay-per-incident" as my last option then! I really appreciate your help because I am in a difficult position!!!!!
Because I am really curious with the help of you guys I found the keys and I am alone here with the servers around me! I can not wait by tomorrow
So what I was thinking is to change power supplies and then see if I have faulty cards. This is to start the second server that doesn't power on - then put working SSDs. I am not sure how to add working server to vSAN cluster without vCenter... So I will try to power on the second server.
Hello Frank,
Sorry, you said you didn't have the vCenter already so CLI it is.
What build of ESXi is in use? This will determine whether you may have to manually add entries to the unicastagent lists.
If it is 6.0/<6.5P01 and not a stretched cluster then it is just a case of validating that you have the vSAN vmk configured in the same subnet as the other hosts vSAN vmk and join the cluster:
On the node that never left cluster:
# esxcli vsan cluster get
Note the 'Sub-Cluster UUID'
On the node that you left cluster on:
# esxcli vsan cluster join -u <sub-Cluster UUID>
As I said above, I think it is worthwhile checking the state of the data components (and where the active/absent/degraded ones reside) with what you have now before starting switching hardware components (so that we can validate whether that will help).
Bob
Please clear some things for me, would it be best to:
A) Try to turn on second Server from the one that has left the cluster parts and put working SSDs?
B) Try to put working SSDs in newly joined server?
Also I would like to know suppose I have three SSDs one with 480GB, one with 1.6TB and one with 1.92TB on 3 servers - same setup
if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?
I am saying this because I do not remember if I have on failed disk or two...
Hello Frank,
That likely won't return any useful data when you still have just one node partitioned by itself.
Have you attempted rejoining the cluster on the node that you erased an SSD?
"if from second server I have a failed disk 1.6TB and from third server 1.92TB. Can I mix the working disks in third server? Or it has to be straight 3 from same server?
I am saying this because I do not remember if I have on failed disk or two..."
Do you have failed disks or disk(s) that you wiped?
Do you recall what disk(s) are failed/wiped? e.g. Capacity-tier or Cache-tier - if you wipe a Cache-tier device the Disk-Group is gone and no you cannot add Capacity-tier devices with data on them to an another existing Disk-Group.
Bob
Hi!
It looks like I am getting somewhere!!!!!!!!!!!!!
Let me explain to clear things up it is important. Suppose I have three servers lets name them
esxi10: This is the working server that it has never left vSAN cluster (vCenter lives in there also), and has no faulty SSDs
esxi20: This is the server that does not power on, it has never left vSAN - I do not remember if I have faulty SSDs
esxi30: This is the server I joined back the cluster, it has one wiped SSD and one SSD that does not work at all.
OK I put esxi30 in maintenance mode, and put all the disks of esxi20 that does not power on to esxi30. It looks like all three SSDs are working fine (Phewww). And esxi30 can now see the vSAN datastore :smileygrin:
esxi10 continues to sees vSAN as 0. How should I continue?!
OK I have tried to restart services for esxi10 with:
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
Now it looks like esxi10 sees the vSAN also! Now I came to my initial state (step "3") -> where I could not access the VMs after restart but could see the vSAN datastore.
Please check attachments.
1) I can not unregister and register VMs as vSAN appears empty in explorer.
2) I can not power on any machine (it is grayed out) as they appear invalid.
Hello Frank,
" And esxi30 can now see the vSAN datastore
esxi10 continues to sees vSAN as 0. How should I continue?!"
Please share the output from both nodes of:
# df -h
# esxcli vsan cluster get
# vdq -Hi
# esxcli vsan storage list
# cmmds-tool find -t HOSTNAME -f json
# cmmds-tool find -t NODE_DECOM_STATE -f json
If a node is properly clustered with other nodes then it should see the size of the vsanDatastore as the total of its own storage and the other clustered nodes storage (unless they are a) in Maintenance Mode or b) have their local-storage unmounted/inaccessible).
Bob
Did you see my last post? Now both servers see the vSAN
For esxi10:
[root@esxi10:~] df -h
Filesystem Size Used Available Use% Mounted on
VMFS-6 68.2G 5.3G 63.0G 8% /vmfs/volumes/DatastoreHP 1
VMFS-6 3.6T 151.3G 3.4T 4% /vmfs/volumes/NAS BackUP
vfat 285.8M 209.1M 76.8M 73% /vmfs/volumes/5aef06c6-480da194-cbd0-a03 69f1fd368
vfat 249.7M 159.2M 90.5M 64% /vmfs/volumes/7e39d0ef-9c1fd3eb-730f-029 c611a571f
vfat 249.7M 151.5M 98.3M 61% /vmfs/volumes/dc7018e8-ba3651d8-949c-d20 a095b30e1
vsan 1.7T 964.3G 824.2G 54% /vmfs/volumes/vsanDatastore
[root@esxi10:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-04-21T18:29:09Z
Local Node UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 529f57d0-a063-c30e-191f-8c9dab9faada
Sub-Cluster Membership Entry Revision: 14
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 5aef05d3-86c5-8538-a5c8-a0369f1fd368
Sub-Cluster Membership UUID: 95ee975e-2188-ae76-a2f7-a0369f1fd368
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 80535089-0092-41b3-93dc-3df97c24b6b0 4 2020-03-27T10:40:06.555
[root@esxi10:~] vdq -Hi
Mappings:
DiskMapping[0]:
SSD: naa.5000c5003017925b
MD: naa.5000c5003018c6d3
[root@esxi10:~] esxcli vsan storage list
naa.5000c5003017925b
Device: naa.5000c5003017925b
Display Name: naa.5000c5003017925b
Is SSD: true
VSAN UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8
VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8
VSAN Disk Group Name: naa.5000c5003017925b
Used by this host: true
In CMMDS: true
On-disk format version: 5
Deduplication: false
Compression: false
Checksum: 7053597502770896794
Checksum OK: true
Is Capacity Tier: false
Encryption: false
DiskKeyLoaded: false
Creation Time: Tue May 8 11:33:35 2018
naa.5000c5003018c6d3
Device: naa.5000c5003018c6d3
Display Name: naa.5000c5003018c6d3
Is SSD: true
VSAN UUID: 52a1d26c-9f75-4646-9e1b-203690ad4d57
VSAN Disk Group UUID: 528579cb-e1a8-0b53-95b2-bcb90bfe3cf8
VSAN Disk Group Name: naa.5000c5003017925b
Used by this host: true
In CMMDS: true
On-disk format version: 5
Deduplication: false
Compression: false
Checksum: 13265426673954443271
Checksum OK: true
Is Capacity Tier: true
Encryption: false
DiskKeyLoaded: false
Creation Time: Tue May 8 11:33:35 2018
[root@esxi10:~] cmmds-tool find -t HOSTNAME -f json
{
"entries":
[
{
"uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",
"owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",
"health": "Healthy",
"revision": "0",
"type": "HOSTNAME",
"flag": "2",
"minHostVersion": "0",
"md5sum": "148ef7e719a8a60fe2691226efc28b1b",
"valueLen": "32",
"content": {"hostname": "esxi10.virtual.store"},
"errorStr": "(null)"
}
,{
"uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"health": "Unhealthy",
"revision": "0",
"type": "HOSTNAME",
"flag": "0",
"minHostVersion": "0",
"md5sum": "2215d3424456e0aaf039f1e639e0014d",
"valueLen": "32",
"content": {"hostname": "esxi30.virtual.store"},
"errorStr": "(null)"
}
]
}
[root@esxi10:~] cmmds-tool find -t NODE_DECOM_STATE -f json
{
"entries":
[
{
"uuid": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",
"owner": "5aef05d3-86c5-8538-a5c8-a0369f1fd368",
"health": "Healthy",
"revision": "10",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
,{
"uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"health": "Unhealthy",
"revision": "0",
"type": "NODE_DECOM_STATE",
"flag": "0",
"minHostVersion": "0",
"md5sum": "ccd967c4aed2b1781c86c6e18e5d8348",
"valueLen": "80",
"content": {"decomState": 6, "decomJobType": 1, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 5},
"errorStr": "(null)"
}
]
}
[root@esxi10:~]
For esxi30:
[root@esxi30:~] df -h
Filesystem Size Used Available Use% Mounted on
VMFS-6 68.2G 5.3G 63.0G 8% /vmfs/volumes/DatastoreHP 2
VMFS-6 3.6T 151.3G 3.4T 4% /vmfs/volumes/NAS BackUP
vfat 249.7M 159.3M 90.5M 64% /vmfs/volumes/38700cee-efb20e12-8f17-ea7cb3da94b7
vfat 285.8M 209.1M 76.8M 73% /vmfs/volumes/5aef35d7-7f17cad8-5f16-a0369f1fd36c
vfat 249.7M 151.5M 98.3M 61% /vmfs/volumes/b7cdaaec-14e48f24-4c35-859dabb8d5b9
vsan 1.7T 768.6G 1019.9G 43% /vmfs/volumes/vsanDatastore
[root@esxi30:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-04-21T18:33:22Z
Local Node UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5aef3508-100f-974c-ba2e-a0369f1fd36c
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 5aef3508-100f-974c-ba2e-a0369f1fd36c
Sub-Cluster Membership UUID: 6d1a9f5e-7c21-2810-3f16-a0369f1fd36c
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: None 0 0.0
[root@esxi30:~] vdq -Hi
Mappings:
DiskMapping[0]:
SSD: naa.5000c50030176437
MD: naa.5000c5003015fc67
[root@esxi30:~] esxcli vsan storage list
naa.5000c50030176437
Device: naa.5000c50030176437
Display Name: naa.5000c50030176437
Is SSD: true
VSAN UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389
VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389
VSAN Disk Group Name: naa.5000c50030176437
Used by this host: true
In CMMDS: true
On-disk format version: 5
Deduplication: false
Compression: false
Checksum: 584334970149048049
Checksum OK: true
Is Capacity Tier: false
Encryption: false
DiskKeyLoaded: false
Creation Time: Tue May 8 14:33:49 2018
naa.5000c5003015fc67
Device: naa.5000c5003015fc67
Display Name: naa.5000c5003015fc67
Is SSD: true
VSAN UUID: 52c7e41d-69a6-bf50-ff6f-0988952f2379
VSAN Disk Group UUID: 52a3828b-8bf2-20da-5da2-ec67db9d7389
VSAN Disk Group Name: naa.5000c50030176437
Used by this host: true
In CMMDS: true
On-disk format version: 5
Deduplication: false
Compression: false
Checksum: 3089451523050927556
Checksum OK: true
Is Capacity Tier: true
Encryption: false
DiskKeyLoaded: false
Creation Time: Tue May 8 14:33:49 2018
[root@esxi30:~] cmmds-tool find -t HOSTNAME -f json
{
"entries":
[
{
"uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"health": "Healthy",
"revision": "0",
"type": "HOSTNAME",
"flag": "2",
"minHostVersion": "0",
"md5sum": "2215d3424456e0aaf039f1e639e0014d",
"valueLen": "32",
"content": {"hostname": "esxi30.virtual.store"},
"errorStr": "(null)"
}
]
}
[root@esxi30:~] cmmds-tool find -t NODE_DECOM_STATE -f json
{
"entries":
[
{
"uuid": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"owner": "5aef3508-100f-974c-ba2e-a0369f1fd36c",
"health": "Healthy",
"revision": "7",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
]
}
[root@esxi30:~]
Hello Frank,
"Sub-Cluster Member Count: 1"
They are not members of the same cluster
Because you used node esxi10's UUID instead of the cluster UUID:
"[root@esxi30:~] esxcli vsan cluster get
...
Sub-Cluster UUID: 5aef05d3-86c5-8538-a5c8-a0369f1fd368"
So basically you created a new cluster with that UUID.
Leave and rejoin the cluster correctly using on esxi30:
# esxcli vsan cluster leave
# esxcli vsan cluster join -u 529f57d0-a063-c30e-191f-8c9dab9faada
You may need to manually repopulate the unicastagent lists:
Bob