Hello all,
I will try to describe the case with as less words as possible, but please read through all the details,
before contacting me to undertake this assignment.
We are willing to pay on a per hour or a fixed fee, for an expert to assist in rescuing our data.
(contact me via Private Message or post reply below with preferable method of contact)
Recently, we did setup VSAN, with 6 Dell servers, each server contributing 1 SSD and 1 SATA drives.
All seemed to go well, so after a month or so of reliable operation, we decided to deploy in VSAN some critical data / production virtual machines.
Recently, due to some networking outage (which later got fixed), VSAN lost the drives and does not seem to be able to resync 2 of them !!!
The huge problem, is we don't have a backup of this critical data.
(Yes, we know this is bad news and bad setup to have no backup, but, we used to have a Synology which
stored our corporate data, but as we wanted to attach this to vCenter, we decided to move the data for few
days in VSAN (a Windows VM), until we reformatted Synology and attach it to vCenter as datastore.
Murphy's law here seem to have a perfect match, something could go wrong and it did !)
The problem seems to be something with VSAN inability to initialize the Disk Group, which fails with message: Out of memory
(Not sure if this is a bug of VSAN or something solvable without official support)
Please find below, all related logs and print-screens, that might help you more to analyze if this is something you can do or not.
HARDWARE INFORMATION:
PowerEdge 1950
BIOS Version 2.7.0
Service Tag ****
Host Name esxi01.****
Operating System Name VMware ESXi 5.5.0 build-1474528
4 NIC (2 on board and 2 on PCIe)
RAC Information
Name DRAC 5
Product Information Dell Remote Access Controller 5
Hardware Version A00
Firmware Version 1.65 (12.08.16)
Firmware Updated Thu Feb 27 09:47:26 2014
RAC Time Wed Oct 15 17:18:12 2014
DELL Perc 6/I Integrated Controller
HDD LIST:
naa.6001e4f01f123d00ff000054053ef0b3
Device: naa.6001e4f01f123d00ff000054053ef0b3
Display Name: naa.6001e4f01f123d00ff000054053ef0b3
Is SSD: false
VSAN UUID: 527eb553-84f7-dcfd-8d86-d7fac441ae69
VSAN Disk Group UUID: 5284a2b8-286f-06a1-446c-0859b15c8c48
VSAN Disk Group Name: naa.6001e4f01f123d001aa3557a0c1ba75d
Used by this host: true
In CMMDS: true
Checksum: 3053228329873442935
Checksum OK: true
naa.6001e4f01f123d001aa3557a0c1ba75d
Device: naa.6001e4f01f123d001aa3557a0c1ba75d
Display Name: naa.6001e4f01f123d001aa3557a0c1ba75d
Is SSD: true
VSAN UUID: 5284a2b8-286f-06a1-446c-0859b15c8c48
VSAN Disk Group UUID: 5284a2b8-286f-06a1-446c-0859b15c8c48
VSAN Disk Group Name: naa.6001e4f01f123d001aa3557a0c1ba75d
Used by this host: true
In CMMDS: true
Checksum: 10629673102587722323
Checksum OK: true
naa.6001e4f01f123d001bbd4d1006c2faa8
Device: naa.6001e4f01f123d001bbd4d1006c2faa8
Display Name: naa.6001e4f01f123d001bbd4d1006c2faa8
Is SSD: false
VSAN UUID: 529f41b2-6c21-d365-a3ea-7b7fd2af4a0c
VSAN Disk Group UUID: 5284a2b8-286f-06a1-446c-0859b15c8c48
VSAN Disk Group Name: naa.6001e4f01f123d001aa3557a0c1ba75d
Used by this host: true
In CMMDS: true
Checksum: 1032555466466759381
Checksum OK: true
...
IP Addresses:
Esxi01: 192.*.*.50 --- Good Condition
Esxi02: 192.*.*.51 --- Good Condition
Esxi03: 192.*.*.52 --- Good Condition
Esxi04: 192.*.*.53 --- Good Condition
Esxi05: 192.*.*.54 --- Bad Condition
Esxi06: 192.*.*.55 --- Bad Condition
vsan.disks_stats cluster
192.168.240.54
# esxcli vsan storage list
naa.6001e4f01f124b001aa364c60cfb6035
Device: naa.6001e4f01f124b001aa364c60cfb6035
Display Name: naa.6001e4f01f124b001aa364c60cfb6035
Is SSD: true
VSAN UUID: 52448891-b3bb-3f6a-1d63-59a069d745ce
VSAN Disk Group UUID: 52448891-b3bb-3f6a-1d63-59a069d745ce
VSAN Disk Group Name: naa.6001e4f01f124b001aa364c60cfb6035
Used by this host: true
In CMMDS: true
Checksum: 14125957398847925782
Checksum OK: true
naa.6001e4f01f124b001aa364e00e7a60bf
Device: naa.6001e4f01f124b001aa364e00e7a60bf
Display Name: naa.6001e4f01f124b001aa364e00e7a60bf
Is SSD: false
VSAN UUID: 527334fb-2f62-41ef-f067-f99caee21be8
VSAN Disk Group UUID: 52448891-b3bb-3f6a-1d63-59a069d745ce
VSAN Disk Group Name: naa.6001e4f01f124b001aa364c60cfb6035
Used by this host: true
In CMMDS: false
Checksum: 17995978848703998982
Checksum OK: true
naa.6001e4f01f124b001bbd549905ee5d3f
Device: naa.6001e4f01f124b001bbd549905ee5d3f
Display Name: naa.6001e4f01f124b001bbd549905ee5d3f
Is SSD: false
VSAN UUID: 52852189-90e2-d9dd-e7a6-3b63a8510db6
VSAN Disk Group UUID: 52448891-b3bb-3f6a-1d63-59a069d745ce
VSAN Disk Group Name: naa.6001e4f01f124b001aa364c60cfb6035
Used by this host: true
In CMMDS: false
Checksum: 3794219025321980740
Checksum OK: true
LOGS 192.168.240.54
2014-10-15T07:10:45.078Z cpu3:33550)WARNING: Created slab RcSsdParentsSlab_0 (prealloc 0), 50000 entities of size 224, total 10 MB, numheaps 1
2014-10-15T07:10:45.079Z cpu3:33550)WARNING: Created slab RcSsdIoSlab_1 (prealloc 0), 50000 entities of size 65552, total 3125 MB, numheaps 2
2014-10-15T07:10:45.079Z cpu3:33550)WARNING: Created slab RcSsdMdBElemSlab_2 (prealloc 0), 4096 entities of size 52, total 0 MB, numheaps 1
2014-10-15T07:10:45.079Z cpu3:33550)WARNING: Created slab RCInvBmapSlab_3 (prealloc 0), 200000 entities of size 64, total 12 MB, numheaps 1
2014-10-15T07:10:45.079Z cpu1:33181)WARNING: LSOM: LSOMAddDiskGroupDispatch:4923: Created disk for 527334fb-2f62-41ef-f067-f99caee21be8
2014-10-15T07:10:45.079Z cpu1:33181)WARNING: LSOM: LSOMAddDiskGroupDispatch:4923: Created disk for 52852189-90e2-d9dd-e7a6-3b63a8510db6
2014-10-15T07:10:45.079Z cpu1:33181)LSOMCommon: LSOM_DiskGroupCreate:958: Creating disk group heap UUID: 52448891-b3bb-3f6a-1d63-59a069d745ce mdCnt 6 -- ssdQueueLen 20000 -- mdQueueLen 100 --ssdCap 76099203072 -- mdCap 0
2014-10-15T07:10:45.096Z cpu2:33444)WARNING: Created heap LSOMDiskGroup_001 (prealloc 1), maxsize 128 MB
2014-10-15T07:10:45.100Z cpu2:33444)WARNING: Created slab PLOG_TaskSlab_DG_001 (prealloc 1), 20000 entities of size 944, total 18 MB, numheaps 1
2014-10-15T07:10:45.105Z cpu2:33444)WARNING: Created slab LSOM_TaskSlab_DG_001 (prealloc 1), 20000 entities of size 824, total 15 MB, numheaps 1
2014-10-15T07:10:45.107Z cpu2:33444)WARNING: Created slab PLOG_RDTBuffer_DG_001 (prealloc 1), 20000 entities of size 184, total 3 MB, numheaps 1
2014-10-15T07:10:45.107Z cpu2:33444)WARNING: Created slab PLOG_RDTSGArrayRef_DG_001 (prealloc 1), 20000 entities of size 48, total 0 MB, numheaps 1
2014-10-15T07:10:45.125Z cpu2:33444)WARNING: Created slab LSOM_LsnEntrySlab_DG_001 (prealloc 1), 160000 entities of size 200, total 30 MB, numheaps 1
2014-10-15T07:10:45.126Z cpu2:33444)WARNING: Created slab SSDLOG_AllocMapSlab_DG_001 (prealloc 1), 8192 entities of size 34, total 0 MB, numheaps 1
2014-10-15T07:10:45.131Z cpu2:33444)WARNING: Created slab SSDLOG_LogBlkDescSlab_DG_001 (prealloc 1), 8192 entities of size 4570, total 35 MB, numheaps 1
2014-10-15T07:10:45.132Z cpu2:33444)WARNING: Created slab SSDLOG_CBContextSlab_DG_001 (prealloc 1), 8192 entities of size 90, total 0 MB, numheaps 1
2014-10-15T07:10:45.135Z cpu2:33444)WARNING: Created slab BL_NodeSlab_DG_001 (prealloc 1), 28400 entities of size 312, total 8 MB, numheaps 1
2014-10-15T07:10:45.175Z cpu2:33444)WARNING: Created slab BL_CBSlab_DG_001 (prealloc 1), 28400 entities of size 10248, total 277 MB, numheaps 1
2014-10-15T07:10:45.206Z cpu2:33444)WARNING: Created slab BL_NodeKeysSlab_DG_001 (prealloc 1), 5990 entities of size 40971, total 234 MB, numheaps 1
2014-10-15T07:10:45.231Z cpu1:33181)WARNING: LSOMCommon: SSDLOGEnumLogCB:828: Estimated time for recovering 1437639 log blks is 351536 ms device: naa.6001e4f01f124b001aa364c60cfb6035:2
2014-10-15T07:15:00.834Z cpu1:32786)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x1a (0x412e80899f80, 0) to dev "mpx.vmhba34:C0:T0:L0" on path "vmhba34:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2014-10-15T07:15:00.834Z cpu1:32786)ScsiDeviceIO: 2337: Cmd(0x412e80899f80) 0x1a, CmdSN 0x1bb from world 0 to dev "mpx.vmhba34:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2014-10-15T07:15:00.856Z cpu1:32780)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x1a (0x412e80899f80, 0) to dev "t10.DP______BACKPLANE000000" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-10-15T07:15:00.856Z cpu1:32780)ScsiDeviceIO: 2337: Cmd(0x412e80899f80) 0x1a, CmdSN 0x1bc from world 0 to dev "t10.DP______BACKPLANE000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-10-15T07:16:00.929Z cpu1:33546)WARNING: Heap: 3622: Heap LSOM (1073738600/1073746792): Maximum allowed growth (8192) too small for size (53248)
2014-10-15T07:16:00.929Z cpu1:33546)WARNING: Heap: 4089: Heap_Align(LSOM, 49176/49176 bytes, 8 align) failed. caller: 0x418011c41756
2014-10-15T07:16:00.929Z cpu1:33546)WARNING: LSOM: LSOM_InitComponent:198: Cannot init commit flusher: Out of memory
2014-10-15T07:16:00.929Z cpu1:33546)LSOM: LSOMSSDEnumCb:210: Finished reading SSD Log: Out of memory
2014-10-15T07:16:01.601Z cpu0:32779)LSOM: LSOMRecoveryDispatch:2326: LLOG recovery complete 52448891-b3bb-3f6a-1d63-59a069d745ce:Recovered 1585904 entries, Processed 0 entries, Took 316394 ms
2014-10-15T07:16:01.625Z cpu0:32779)WARNING: LSOM: LSOMAddDiskGroupDispatch:5090: Failed to add disk group. SSD 52448891-b3bb-3f6a-1d63-59a069d745ce: Out of memory
2014-10-15T07:16:01.625Z cpu3:33174)WARNING: PLOG: PLOGNotifyDisks:2854: Notify disk group failed for SSD UUID 52448891-b3bb-3f6a-1d63-59a069d745ce :Out of memory was recovery complete ? No
2014-10-15T07:16:01.625Z cpu3:33174)PLOG: PLOG_Recover:518: Recovery on SSD naa.6001e4f01f124b001aa364c60cfb6035:2 had failed with Out of memory
2014-10-15T07:16:01.625Z cpu3:33174)WARNING: PLOG: PLOGRecoverDevice:4251: Recovery failed for disk group with SSD naa.6001e4f01f124b001aa364c60cfb6035
2014-10-15T07:16:01.625Z cpu3:33174)WARNING: PLOG: PLOGInitAndAnnounceMD:4167: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.6001e4f01f124b001aa364e00e7a60bf
2014-10-15T07:16:01.625Z cpu3:33174)WARNING: PLOG: PLOGInitAndAnnounceMD:4167: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.6001e4f01f124b001bbd549905ee5d3f
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGAnnounceSSD:4071: Successfully added VSAN SSD (naa.6001e4f01f124b001aa364c60cfb6035:2) with UUID 52448891-b3bb-3f6a-1d63-59a069d745ce
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGNotifyDisks:2805: MD 0 with UUID 527334fb-2f62-41ef-f067-f99caee21be8 with state 0 backing SSD 52448891-b3bb-3f6a-1d63-59a069d745ce notified
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGNotifyDisks:2805: MD 1 with UUID 52852189-90e2-d9dd-e7a6-3b63a8510db6 with state 0 backing SSD 52448891-b3bb-3f6a-1d63-59a069d745ce notified
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGNotifyDisks:2831: Recovery on SSD 52448891-b3bb-3f6a-1d63-59a069d745ce had failed earlier, SSD not published
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOG_Recover:518: Recovery on SSD naa.6001e4f01f124b001aa364c60cfb6035:2 had failed with Out of memory
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGRecoverDevice:4251: Recovery failed for disk group with SSD naa.6001e4f01f124b001aa364c60cfb6035
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGInitAndAnnounceMD:4167: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.6001e4f01f124b001aa364e00e7a60bf
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGInitAndAnnounceMD:4167: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.6001e4f01f124b001bbd549905ee5d3f
2014-10-15T07:16:14.007Z cpu2:34036)Vol3: 714: Couldn't read volume header from naa.6001e4f01f124b001aa364e00e7a60bf:1: I/O error
2014-10-15T07:16:14.016Z cpu2:34036)Vol3: 714: Couldn't read volume header from naa.6001e4f01f124b001aa364e00e7a60bf:1: I/O error
2014-10-15T07:16:14.029Z cpu2:34036)FSS: 5051: No FS driver claimed device 'naa.6001e4f01f124b001aa364e00e7a60bf:1': Not supported
192.168.240.55
naa.6001e4f02024b1001aa354fc1821ac19
Device: naa.6001e4f02024b1001aa354fc1821ac19
Display Name: naa.6001e4f02024b1001aa354fc1821ac19
Is SSD: false
VSAN UUID: 52c0faf1-b242-78ab-6fbc-bdcb1d6c0b96
VSAN Disk Group UUID: 52f669f0-21c7-096f-2858-9142fc2ef315
VSAN Disk Group Name: naa.6001e4f02024b1001aa354b8140ff5be
Used by this host: true
In CMMDS: false
Checksum: 10210539749589974934
Checksum OK: true
naa.6001e4f02024b1001aa354b8140ff5be
Device: naa.6001e4f02024b1001aa354b8140ff5be
Display Name: naa.6001e4f02024b1001aa354b8140ff5be
Is SSD: true
VSAN UUID: 52f669f0-21c7-096f-2858-9142fc2ef315
VSAN Disk Group UUID: 52f669f0-21c7-096f-2858-9142fc2ef315
VSAN Disk Group Name: naa.6001e4f02024b1001aa354b8140ff5be
Used by this host: true
In CMMDS: true
Checksum: 3807632316314793651
Checksum OK: true
From all the logs and screenshots above, as you can see, VSAN does fail to initialize the Disk Group, with error: Out of memory
VMKERNEL
2014-10-15T07:16:09.189Z cpu2:33885)WARNING: PLOG: PLOGNotifyDisks:2831: Recovery on SSD 52448891-b3bb-3f6a-1d63-59a069d745ce had failed earlier, SSD not published
2014-10-15T07:16:09.189Z cpu2:33885)PLOG: PLOG_Recover:518: Recovery on SSD naa.6001e4f01f124b001aa364c60cfb6035:2 had failed with Out of memory
2014-10-15T07:16:09.189Z cpu2:33885)WARNING: PLOG: PLOGRecoverDevice:4251: Recovery failed for disk group with SSD naa.6001e4f01f124b001aa364c60cfb6035
2014-10-15T07:16:13.965Z cpu2:34036)LSOMCommon: SSDLOG_AddDisk:559: Existing ssd found naa.6001e4f01f124b001aa364c60cfb6035:2
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGAnnounceSSD:4071: Successfully added VSAN SSD (naa.6001e4f01f124b001aa364c60cfb6035:2) with UUID 52448891-b3bb-3f6a-1d63-59a069d745ce
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGNotifyDisks:2805: MD 0 with UUID 527334fb-2f62-41ef-f067-f99caee21be8 with state 0 backing SSD 52448891-b3bb-3f6a-1d63-59a069d745ce notified
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOGNotifyDisks:2805: MD 1 with UUID 52852189-90e2-d9dd-e7a6-3b63a8510db6 with state 0 backing SSD 52448891-b3bb-3f6a-1d63-59a069d745ce notified
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGNotifyDisks:2831: Recovery on SSD 52448891-b3bb-3f6a-1d63-59a069d745ce had failed earlier, SSD not published
2014-10-15T07:16:13.965Z cpu2:34036)PLOG: PLOG_Recover:518: Recovery on SSD naa.6001e4f01f124b001aa364c60cfb6035:2 had failed with Out of memory
2014-10-15T07:16:13.965Z cpu2:34036)WARNING: PLOG: PLOGRecoverDevice:4251: Recovery failed for disk group with SSD naa.6001e4f01f124b001aa364c60cfb6035
VAN DISK GROUP GROUP FAILED WITH FOLLOWING ERROR FROM VMKERNEL
PLOG_Recover:518: Recovery on SSD naa.6001e4f01f124b001aa364c60cfb6035:2 had failed with Out of memory
Dear Simon,
I have updated to 5.5 U2 the 5 of 6 hosts so far,
The hosts status is the following:
192.168.240.50 --- Not updated yet, working on it now
192.168.240.51 --- was OK before update and is NOT OK after update, it is hang on boot
192.168.240.52 --- was OK before update and is OK after update
192.168.240.53 --- was OK before update and is OK after update
192.168.240.54 --- was broken Disk Group with memory error and after update the error was resolved and Disk Group worked
192.168.240.55 --- was broken Disk Group with memory error and after update the same error continues to exist
Please find below the hang screen of 192.168.240.51:
Please note, the update finished successfully and this error occured on reboot after update, the host is hang here 6 hours now.
Please advice how to proceed with this !
Thanks !
Chris
I have just replied back to you through the support request, if you can provide logs for 192.168.240.55 I will then review them, this may be a situation that there is nothing further that we can do, it may well be down to the behaviour of the uncertified hardware in this case, but I will review the logs and offer my findings through the SR
Simon
Dear Simon,
I have updated to 5.5 U2 the 6 of 6 hosts so far,
The hosts status is the following:
192.168.240.50 --- was OK before update and is OK after update and now Disk Group works
192.168.240.51 --- was OK before update and was NOT OK after update, it was hang on boot, but problem got fixed after a reboot and now Disk Group works
192.168.240.52 --- was OK before update and is OK after update and now Disk Group works
192.168.240.53 --- was OK before update and is OK after update and now Disk Group works
192.168.240.54 --- was broken Disk Group with memory error and after update the error was resolved and Disk Group worked
192.168.240.55 --- was broken Disk Group with memory error and after update the same error continues to exist
Additionally, we tried updating the vCenter to 5.5 U2, but we have the following error:
Please advice how to proceed with this problem of updating vCenter !
Finally, we have also uploaded in FTP, the logs of the crashed server 192.168.240.55:
in the folder: "14544291010", on the file: "VMware-vCenter-support-2014-10-26@21-09-59.zip"
Thanks !
It looks like you are trying to "upgrade" rather than to "update" the vCenter Server Appliance. An Upgrade with exchanging keys is only supported from a previous version. The way to update from any 5.5 build to the latest one is to e.g. Update through a direct Internet connection or by using the "VMware-vCenter-Server-Appliance-5.5.0-....-updaterepo.iso" image from the UI's "Update" tab.
André
Dear Simon,
I have tried the command you gave me in SR to increase the
heapsize of memory that LSOM has allocated, then did reboot,
with no success on 192.168.240.55,
it still fails to bring up the Disk Group with memory error ...
Please find below some screenshots, as I find it very strange what is happening
and I would like to know what is going wrong here.
As you mentioned earlier, the current setup is unsupported, but still I was not able
to find anything faulty with hardware and as it works on the remaining nodes
(and actually did recover same issue after 5.5 U2 update on node .54),
I see no reason why it fails on .55
As you will notice here, I see that the system finds VSAN has 3.4 TB of data,
which is correct and my data is within this (same size was before initial failure)
But the strange thing, is that inside VSAN, I can't see these data anywhere,
only few GB of some ISO images I had...
Where is all these 3.4 TB of data and why I cannot access them ???
Additionally, it does not make much sense how it is possible VSAN to face this failure and dataloss,
as the policy set before failure, was to have the data in 4 nodes, into 2 RAID-1 stripes, each of them,
consisted by 2 RAID-0 datasets...
So, normally, with just one node down / inaccessible, I see only 1 of 4 nodes is down, which means
just 1 raid-0 part is missing, the other one stripe of the raid 1 and the half raid0 are OK and it totally
makes no sense to me why I cannot access the data...
Any ideas ???
Additionally, I have run "vsan.support_information" and have sent via FTP a detailed log,
plus info about it via the SR, so please check this and let me know what can be done...
Thanks on advance for any help on this, I hope there is a way to save my data...
Additionally,
I also noticed there is a hidden option in vsish, to enableRecovery of LSOM,
would this be of any help or what does this do ???
Any news about this ?
Thanks !
Further bump, as I am interested in the outcome on this. Has there been movement on it?
Simon had sent Chris an additional command to run on November 5 and waiting for outcome.
The ball is in Chris's court right now 🙂
Thanks
Unfortunately, even after using a private command (provided by engineers of VSAN)
that was supposed to persistently increase heap memory size of VMkernel, in order
to give a chance for the rebuild / replay of SSD log, we had no luck with it and even
with several reboots, the last node did not bring back the disk group...
As all this mess was not enough, additionally, after all these months trying to rescue
our data, yesterday, 2 drives did fail on the VSAN, making it impossible to proceed further.
I do not know if this has occured due to extreme stressing during rebuilds,
if the controller caused this, if the private command stressed more than it should
the drives or if it was just "their time" to die (not reasonable for 3 months old disks),
but anyway, the result is the same, we lost valuable client's data that cannot be
replaced and their value / overall loss for us, exceeds 200,000 USD...
I wish it was possible to bring back the data using only the 5 of 6 nodes,
and simply drop the 6th failed node, but as it seems, VSAN acted strange
and prevented this, as when the initial networking issue with switches occured,
the nodes started becoming unavailable sequentially from 50 to 55, while
VSAN was attempting in the background to move all the data to the remaining
nodes and resulted in accumulating most of the data chunks on the last
node (.55), so having it unavailable, it prevents us from getting back our data
using only the remaining nodes.
In my opinion, it is not a smart way to handle failures, as if VSAN had detected
the actual problem (unavailable / foreign disks getting kicked out of the cluster)
is NOT an actual hardware failure but just a networking issue, it would have
prevented automated rebuilding, which in the end, maybe useful for small scale
failures, but seems to bring a huge mess and data loss, if the failures are of that scale,
which by the way, are impossible to actually ever happen (all hardware at once) and
only a networking issue would be able to cause this.
Finally, I can't believe it is related to incompatible hardware or something related
to hardware, as it used to work just fine for the first 30 days of evaluation.
Thanks Chris for the update.
VSAN did what it was supposed to do by starting the resync operations after a time out expired post loss of network connectivity. If the network were configured with redundant connections to avoid single points of failure, the network partitioning situation would not have existed. However, I am not certain of your network design and I cannot provide specific feedback to correct it.
In addition, you were using beta code when you created the VSAN. The beta on-disk layout and size is different from the GA release and may have contributed to using up the heap memory regardless of the changes we suggested.
The problem is that using Adapter and SSD not on the HCL may have caused the following:
1. SSD is unable to meet the space, durability and performance requirements for heavy resync operations. If the ratio of SSDs to HDDs is too low (recommended is 10%), even if the SSD is on the HCL, it would not cope with resync workload.
2. Adapter queue depth may be too small to meet the load caused by heavy resync operations. Certified adapters are required to have a minimum queue depth value for it to pass.
3. Adapter firmware may not be configured optimally for VSAN operations and workload. Certification process identifies such issues and would have resulted in the vendor providing updated firmware to address the required optimization.
As for 2 HDDs failing at the same time, they would have failed later regardless. It was just a matter of time for these specific 2 disks. Sometimes disks from the same batch may have the same defects or life span limitations.
Using disks listed on the HCL simply represents higher confidence in their performance, stability and firmware compatibility.
I sympathize with you on the loss of data. We gave it a try, though. If the underlying configuration was reviewed by VMware while you were in the proof of concept phase, we would have alerted you to the unsupported hardware and possibly the network points of failure if identified.
Customers should consult VMware HCL before committing to hardware purchase regardless of the planned solution. VMware cannot predict the stability, compatibility or performance of hardware components that are not certified.
I hope you will give VSAN another try using certified hardware and follow the recommended best practices for designing and configuring the solution.
Hello,
I read the whole thread and am left with mixed feelings about VSAN, as it is not clear to me, if this was finally a VSAN bug or something the user did wrong.
In any case, I have also noticed some things that annoy me:
1. VMware staff, does state that the version used, is unsupported as it is using BETA code, which is totally false.
In fact, I have also played in our labs with vsan and remember to have downloaded 5.5 version, in which vsan was an integrated part of it and there was
no place it said it is beta at download or installation or kb at that time.
This KB and generally the "renaming" of the release to beta, occured long time after people got it and installed it, leaving all us thinking, that something
went terribly wrong with vsan and vmware had to avoid liabilities for this, so they simply made this beta (with discounted price!), while it was not initially
a planned beta.
2. VMware staff, also seems to drop the ball, regarding what is really the underlying problem and accuse hardware incompatibility, while it was 100%
a software problem.
I would feel very bad and exposed to my clients, if I had adviced them to purchase this software for lot of thousands of dollars and then have such unfair "game" here.
I don't believe something would be done differently for a paid client, as the intentions here are clear, to avoid the liability while you already know it is your software's fault.
3. The time this took to handle the issue (1 month ???), seems also unacceptable to me, as it seems, if the techs would have handled it properly and
in a timely manner, the drives would have not failed with all these tests and stressing.
4. Even the last response, says that vsan did what it was expected to do. Wrong, software must be made to serve us, not to kill our data by inability
to detect such big-time failures.
When there is a network failure that affects all nodes, it does not make any sense to do the standard procedure and try to rebuild, it is clear a bad coding
practice to be unable to detect something that can totally destroy all user's data ...
So, what is the conclusion here ?
Is VSAN something we can rely our data on ?
Or it is still too early to take the risk and it might be better leave others take the hit and check again in couple of years ???
ryanwls wrote:
In any case, I have also noticed some things that annoy me:1. VMware staff, does state that the version used, is unsupported as it is using BETA code, which is totally false.
In fact, I have also played in our labs with vsan and remember to have downloaded 5.5 version, in which vsan was an integrated part of it and there was
no place it said it is beta at download or installation or kb at that time.
This KB and generally the "renaming" of the release to beta, occured long time after people got it and installed it, leaving all us thinking, that something
went terribly wrong with vsan and vmware had to avoid liabilities for this, so they simply made this beta (with discounted price!), while it was not initially
a planned beta.
Hi,
Vsan was released for production in vSphere 5.5U1.
After the release of vSphere 5.5 VMware released a special Beta build for the vSan public beta.
This beta version was used by the topic starter.
Additionally the topic starter placed production data on beta software, without having a support contract and without a proper backup.
Especially the missing backup is a very big mistake.
Could not disagree more. This is clearly an implementation of vSAN issue. I hate to pile on a fellow IT worker who is in a bad spot. 1.)It was a BETA version. 2.)It was an eval version with no support. I think it was great VMware opened an SR and tried to help. 3.)The key pieces of hardware (SSD and controller) were not on the HCL. 4.)No vSAN networking redundancy. 5.)And worst, no backups of the data.
Seems clear to me.
crosdorff wrote:
ryanwls wrote:
In any case, I have also noticed some things that annoy me:1. VMware staff, does state that the version used, is unsupported as it is using BETA code, which is totally false.
In fact, I have also played in our labs with vsan and remember to have downloaded 5.5 version, in which vsan was an integrated part of it and there was
no place it said it is beta at download or installation or kb at that time.
This KB and generally the "renaming" of the release to beta, occured long time after people got it and installed it, leaving all us thinking, that something
went terribly wrong with vsan and vmware had to avoid liabilities for this, so they simply made this beta (with discounted price!), while it was not initially
a planned beta.
Hi,
Vsan was released for production in vSphere 5.5U1.
After the release of vSphere 5.5 VMware released a special Beta build for the vSan public beta.
This beta version was used by the topic starter.
Additionally the topic starter placed production data on beta software, without having a support contract and without a proper backup.
Especially the missing backup is a very big mistake.
First of all, I must say that I agree having no backup plan is a total fail here, as original poster has also noticed in the beginning.
But on the other hand, I will have to disagree with you about the Beta build.
I have a copy of vSphere 5.5 (before U1), which does have VSAN and there is no indication anywhere in the product, during installation or after inside the GUI,
that says it is in fact a beta.
It normally gives you 60 days to evaluate by default, after install completes and it also accepts a valid license key (paid), which is not reasonable for a beta, right ?
Additionally, I cannot find anywhere something indicating it was initially released as beta, during the release time, but only after the release date,
there seems to be a huge problem with clients losing data and VMware marks it as beta.
zdickinson wrote:
Could not disagree more. This is clearly an implementation of vSAN issue. I hate to pile on a fellow IT worker who is in a bad spot. 1.)It was a BETA version. 2.)It was an eval version with no support. I think it was great VMware opened an SR and tried to help. 3.)The key pieces of hardware (SSD and controller) were not on the HCL. 4.)No vSAN networking redundancy. 5.)And worst, no backups of the data.
Seems clear to me.
I agree with you that it is not a good practice to pile on a fellow IT worker, but being in a bad spot, is the result of his work and results (or lack of them).
1. About the continuous flagging of the product as BETA, as I already mentioned before to post 35, this is not true, at least not inside the product, only on press releases and kbs opened AFTER the actual release date.
2. I agree it is great of VMware to have opened an SR and trying to help, but losing data due to a bug of their software, is surely a big problem. Support is for engineers, bugs are for developers, the failure here seems clear to me.
3. The HCL is always the way I would select to go, but still, I never had trouble using (only for labs tests) vmware products in unsupported hardware. Not all people can afford having latest technologies and we all know that the costs involved with all the competition in hosting field, do not give you big margins for innovation.
4. It is unclear if there was networking redundancy or not in this case, but in my lab tests, I have faced troubles even on stacked switching.
5. Cannot agree more, backups is always the key.
First of all VSAN is a product that uses vSphere as a delivery vehicle.
I agree with all that it is confusing that vSphere 5.5 GA that already has VSAN feature included is actually VSAN beta code.
The problem is that VMware needed a longer beta cycle for VSAN that extended beyond the planned vSphere 5.5 release date. So, the actual GA code of VSAN was released in 5.5 Update 1 after the beta cycle concluded.
So, the bug reported in this post is actually a bug in the VSAN beta code which was fixed in the VSAN GA code via vSphere 5.5 Update 1. (confused yet?)
Please note that upgrading the software components to 5.5 Update1 or later AFTER VSAN cluster was created with 5.5 GA DOES NOT result in a supported configuration. The reason is that the on-disk format of the VSAN Datastore was changed in the VSAN GA code (in Update 1 or later). So, using the beta created datastore will still have some of the bugs that were fixed post beta.
I am currently reviewing any VSAN related documents to make sure that they all state 5.5 Update 1 or later as the minimum software requirement.
As for the HCL, even if a product that is not listed there works in your test/lab environment, it does not mean that it will meet the certification guidelines or required endurance metrics.
For example, using an HBA that is not certified, you may end up with a very small queue depth. So, while things may work well for a while, once you have a need to resync a large number of objects, you may get the command queue full quickly and operations will grind down to a halt.
Another example is using an SSD that is not on the HCL. Such an SSD may not meet the minimum performance or endurance requirements and may die much sooner than certified SSDs. There is also certain bugs/performance functions are included in certified SSDs firmware versions. Using older versions may result in severe performance degradation and instability of the environment.
There are reasons for having the HCL in place which do not include inconvenience to the customer. Rather, it is for their protection and making sure the certified components meet reasonable performance and endurance metrics.