Re: VSAN: Initialization for SSD Failed

soroshsabz · ‎07-15-2018

ITNOA

Hi all,

As I asked in https://serverfault.com/q/922056/145811,

I have two SuperMicro servers and install `ESXI 6.0.0` on both of them and create vSAN with them and installs all VMs on `vsanStorage`. and each of them have two SSD storage with RAID 1 and two HDD with RAID 1. after power failure in my data center, all VM's in one server is orphaned and all VM's in another server is inaccessible. after some investigating around problem I found one of my server could not initialize VSAN, and get many errors like below:

    865)CMMDS: MasterAddNodeToMembership:4982: Added node 5777c24c-2568-7ec6-4dd8-005056bb8703 to the cluster membership
    0:07:29.240Z cpu27:34329)VSAN Device Monitor: Checking VSAN device latencies and congestion.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x1 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x2 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x3 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x4 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x5 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    519)ScsiDeviceIO: 2651: Cmd(0x439e17f1ca00) 0x1a, CmdSN 0x6 from world 34314 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    4357)Tracing: dropped 707185 traces (707185 total)
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6bf from world 0 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6c4 from world 0 to dev "naa.600304801cb841001f08f209107cfabe" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6ca from world 0 to dev "naa.600304801cb841001f08f22c1296cd81" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6d0 from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6d5 from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6da from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)NMP: nmp_ThrottleLogForDevice:3231: last error status from device naa.600304801cb841001f08f19809c8d99a repeated 80 times
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6df from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6e4 from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a780) 0x1a, CmdSN 0x6e9 from world 0 to dev "naa.600304801cb841001f08f22c1296cd81" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    4465)PLOG: PLOGProbeDevice:5213: Probed plog device <naa.600304801cb841001f08f22c1296cd81:1> 0x4305394dd770 exists.. continue with old entry
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a600) 0x1a, CmdSN 0x6ef from world 0 to dev "naa.600304801cb841001f08f209107cfabe" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a600) 0x1a, CmdSN 0x6f5 from world 0 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    4465)PLOG: PLOGProbeDevice:5213: Probed plog device <naa.600304801cb841001f08f1ce0cfa04ce:1> 0x4305390d9630 exists.. continue with old entry
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a480) 0x1a, CmdSN 0x6fa from world 0 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    4465)PLOG: PLOGProbeDevice:5213: Probed plog device <naa.600304801cb841001f08f1ce0cfa04ce:2> 0x4305390da670 exists.. continue with old entry
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a480) 0x1a, CmdSN 0x6ff from world 0 to dev "naa.600304801cb841001f08f22c1296cd81" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.aa.64465)PLOG: PLOGProbeDevice:5213: Probed plog device <naa.600304801cb841001f08f22c1296cd81:2> 0x4305394de7b0 exists.. continue with old entry
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a480) 0x1a, CmdSN 0x705 from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
    4465)WARNING: LSOMCommon: LSOM_DiskGroupCreate:1448: Disk group already created uuid: 521ae5f3-eac3-cfa7-e10d-01b2f379762c
    4465)LSOMCommon: SSDLOG_AddDisk:723: Existing ssd found naa.600304801cb841001f08f1ce0cfa04ce:2
    4465)PLOG: PLOGAnnounceSSD:6570: Successfully added VSAN SSD (naa.600304801cb841001f08f1ce0cfa04ce:2) with UUID 521ae5f3-eac3-cfa7-e10d-01b2f379762c
    4465)VSAN: Initializing SSD: 521ae5f3-eac3-cfa7-e10d-01b2f379762c Please wait...
    2959)PLOG: PLOGNotifyDisks:4010: MD 0 with UUID 52f0ac26-c7b0-8f0f-6dbb-3aeddcae32f2 with state 0 formatVersion 4 backing SSD 521ae5f3-eac3-cfa7-e10d-01b2f379762c notified
    2959)WARNING: PLOG: PLOGNotifyDisks:4036: Recovery on SSD 521ae5f3-eac3-cfa7-e10d-01b2f379762c had failed earlier, SSD not published
    2959)WARNING: PLOG: PLOGRecoverDeviceLogsDispatch:4220: Error Failure from PLOGNotifyDisks() for SSD naa.600304801cb841001f08f1ce0cfa04ce
    4465)WARNING: PLOG: PLOGCheckRecoveryStatusForOneDevice:6682: Recovery failed for disk 521ae5f3-eac3-cfa7-e10d-01b2f379762c
    4465)VSAN: Initialization for SSD: 521ae5f3-eac3-cfa7-e10d-01b2f379762c Failed
    4465)WARNING: PLOG: PLOGInitAndAnnounceMD:6901: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.600304801cb841001f08f22c1296cd81
    3520)ScsiDeviceIO: 2651: Cmd(0x43a580c2a480) 0x1a, CmdSN 0x70a from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.








    2018-07-15T21:56:58.882Z cpu25:33315)ScsiDeviceIO: 8409: Get VPD 86 Inquiry for device "naa.600304801cb841001f08f22c1296cd81" from Plugin "NMP" failed. Not supported
    2018-07-15T21:56:58.882Z cpu25:33315)ScsiDeviceIO: 7030: Could not detect setting of QErr for device naa.600304801cb841001f08f22c1296cd81. Error Not supported.
    2018-07-15T21:56:58.882Z cpu25:33315)ScsiDeviceIO: 7544: Could not detect setting of sitpua for device naa.600304801cb841001f08f22c1296cd81. Error Not supported.
    2018-07-15T21:56:58.883Z cpu32:33526)ScsiDeviceIO: 2636: Cmd(0x43bd80c5edc0) 0x1a, CmdSN 0x9 from world 0 to dev "naa.600304801cb841001f08f22c1296cd81" failed H:0x0 D:0x2 P:0x0 Valid2018-07-15T21:56:58.883Z cpu25:33315)ScsiEvents: 300: EventSubsystem: Device Events, Event Mask: 40, Parameter: 0x4302972eff40, Registered!
    2018-07-15T21:56:58.883Z cpu25:33315)ScsiDevice: 3905: Successfully registered device "naa.600304801cb841001f08f22c1296cd81" from plugin "NMP" of type 0


    2018-07-15T21:57:09.321Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f22c1296cd81:2 0x4305644ed110 quiesceTask 0x4305644ee150 on SSD 00000000-002018-07-15T21:57:09.322Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f1ce0cfa04ce:2 0x4305644ef770 quiesceTask 0x4305644ee620 on SSD 00000000-002018-07-15T21:57:09.323Z cpu20:33315)VSANServer: VSANServer_InstantiateServer:2885: Instantiated VSANServer 0x4305644eeb58
    2018-07-15T21:57:09.323Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f1ce0cfa04ce:1 0x4305644f07b0 quiesceTask 0x4305644f17f0 on SSD 521ae5f3-ea2018-07-15T21:57:09.323Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f1ce0cfa04ce:2 0x4305644f1c70 quiesceTask 0x4305644f2cb0 on SSD 521ae5f3-ea2018-07-15T21:57:09.323Z cpu20:33315)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x4305644ef770 naa.600304801cb841001f08f1ce0cfa04ce:2 0x1 00000000-0000-0000-0000-000000000000 is b2018-07-15T21:57:09.323Z cpu20:33315)PLOG: PLOG_FreeDevice:496: Throttled: Waiting for ops to complete on device: 0x4305644ef770 naa.600304801cb841001f08f1ce0cfa04ce:2
    2018-07-15T21:57:09.336Z cpu20:33315)PLOG: PLOGCreateGroupDevice:592: Allocated 65536 trace entries for 521ae5f3-eac3-cfa7-e10d-01b2f379762c
    2018-07-15T21:57:09.336Z cpu20:33315)PLOG: PLOGCreateGroupDevice:611: PLOG disk group for SSD 0x4305644f07b0 521ae5f3-eac3-cfa7-e10d-01b2f379762c is created
    2018-07-15T21:57:09.337Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f22c1296cd81:1 0x4305644ef770 quiesceTask 0x4305648f5120 on SSD 521ae5f3-ea2018-07-15T21:57:09.337Z cpu20:33315)PLOG: PLOG_InitDevice:262: Initialized device naa.600304801cb841001f08f22c1296cd81:2 0x4305648f55a0 quiesceTask 0x4305648f65e0 on SSD 521ae5f3-ea2018-07-15T21:57:09.337Z cpu20:33315)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x4305644ed110 naa.600304801cb841001f08f22c1296cd81:2 0x1 00000000-0000-0000-0000-000000000000 is b2018-07-15T21:57:09.350Z cpu20:33315)LSOMCommon: LSOM_DiskGroupCreate:1461: Creating diskgroup uuid: 521ae5f3-eac3-cfa7-e10d-01b2f379762c (Read cache size: 207773478912, Write buffer2018-07-15T21:57:09.350Z cpu20:33315)LSOMCommon: LSOMGlobalMemInit:1257: Initializing LSOM's global memory


    2018-07-15T21:57:25.776Z cpu30:32970)PLOG: PLOG_Recover:882: Doing plog recovery on SSD naa.600304801cb841001f08f1ce0cfa04ce:2
    2018-07-15T21:57:26.168Z cpu6:33577)Created VSAN Slab PLOGRecovSlab_0x4305644f1c70 (objSize=40960 align=64 minObj=32769 maxObj=32769 overheadObj=1310 minMemUsage=1499476k maxMemUsage2018-07-15T21:57:26.184Z cpu10:33562)PLOG: PLOGHandleLogEntry:320: Recovering SSD state for MD 52f0ac26-c7b0-8f0f-6dbb-3aeddcae32f2
    2018-07-15T21:58:39.226Z cpu0:33525)WARNING: LSOMCommon: SSDLOG_EnumLogCB:1450: SSD corruption detected. device: naa.600304801cb841001f08f1ce0cfa04ce:2
    2018-07-15T21:58:39.226Z cpu10:33562)WARNING: PLOG: PLOGEnumLogCB:411: Log enum CB failed with Corrupt RedoLog
    2018-07-15T21:58:39.226Z cpu10:33562)LSOMCommon: SSDLOG_EnumLogHelper:1401: Throttled: Waiting for 1 outstanding reads
    2018-07-15T21:58:39.226Z cpu0:33525)LSOMCommon: SSDLOG_IsValidLogBlk:132: Invalid version device: naa.600304801cb841001f08f1ce0cfa04ce:2
    2018-07-15T21:58:39.226Z cpu0:33525)WARNING: LSOMCommon: SSDLOG_EnumLogCB:1450: SSD corruption detected. device: naa.600304801cb841001f08f1ce0cfa04ce:2
    2018-07-15T21:58:39.337Z cpu7:33578)Destroyed VSAN Slab PLOGRecovSlab_0x4305644f1c70 (maxCount=32769 failCount=0)
    2018-07-15T21:58:39.337Z cpu22:33742)PLOG: PLOGRecDisp:823: PLOG recovery complete 521ae5f3-eac3-cfa7-e10d-01b2f379762c:Processed 2271342 entries, Took 73154 ms
    2018-07-15T21:58:39.337Z cpu22:33742)PLOG: PLOGRecDisp:832: Recovery for naa.600304801cb841001f08f1ce0cfa04ce:2 completed with Corrupt RedoLog
    2018-07-15T21:58:39.337Z cpu37:33315)WARNING: PLOG: PLOGCheckRecoveryStatusForOneDevice:6702: Recovery failed for disk 521ae5f3-eac3-cfa7-e10d-01b2f379762c
    2018-07-15T21:58:39.337Z cpu37:33315)VSAN: Initialization for SSD: 521ae5f3-eac3-cfa7-e10d-01b2f379762c Failed
    2018-07-15T21:58:39.337Z cpu37:33315)WARNING: PLOG: PLOGInitAndAnnounceMD:6921: Recovery failed for the disk group.. deferring publishing of magnetic disk naa.600304801cb841001f08f222018-07-15T21:58:39.371Z cpu37:33315)Vol3: 2687: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:2' for probing: No underlying device for major,minor
    2018-07-15T21:58:39.372Z cpu37:33315)Vol3: 2687: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:2' for probing: No underlying device for major,minor
    2018-07-15T21:58:39.374Z cpu37:33315)Vol3: 1078: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:2' for volume open: No underlying device for major,minor
    2018-07-15T21:58:39.375Z cpu37:33315)Vol3: 1078: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:2' for volume open: No underlying device for major,minor
    2018-07-15T21:58:39.375Z cpu37:33315)FSS: 5353: No FS driver claimed device 'naa.600304801cb841001f08f1ce0cfa04ce:2': No underlying device for major,minor
    2018-07-15T21:58:39.376Z cpu37:33315)Vol3: 1023: Couldn't read volume header from : I/O error
    2018-07-15T21:58:39.377Z cpu37:33315)Vol3: 1023: Couldn't read volume header from : I/O error
    2018-07-15T21:58:39.380Z cpu37:33315)Vol3: 1023: Couldn't read volume header from naa.600304801cb841001f08f22c1296cd81:1: I/O error
    2018-07-15T21:58:39.381Z cpu37:33315)Vol3: 1023: Couldn't read volume header from naa.600304801cb841001f08f22c1296cd81:1: I/O error
    2018-07-15T21:58:39.381Z cpu37:33315)FSS: 5353: No FS driver claimed device 'naa.600304801cb841001f08f22c1296cd81:1': No filesystem on the device
    2018-07-15T21:58:39.386Z cpu32:33526)ScsiDeviceIO: 2636: Cmd(0x43bd80c20b80) 0x1a, CmdSN 0x147 from world 0 to dev "naa.600304801cb841001f08f19809c8d99a" failed H:0x0 D:0x2 P:0x0 Val2018-07-15T21:58:39.399Z cpu37:33315)Vol3: 2687: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:1' for probing: No underlying device for major,minor
    2018-07-15T21:58:39.400Z cpu37:33315)Vol3: 2687: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:1' for probing: No underlying device for major,minor
    2018-07-15T21:58:39.401Z cpu32:33526)ScsiDeviceIO: 2636: Cmd(0x43bd80c1da80) 0x1a, CmdSN 0x19c from world 0 to dev "naa.600304801cb841001f08f1ce0cfa04ce" failed H:0x0 D:0x2 P:0x0 Val2018-07-15T21:58:39.402Z cpu37:33315)Vol3: 1078: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:1' for volume open: No underlying device for major,minor
    2018-07-15T21:58:39.403Z cpu37:33315)Vol3: 1078: Could not open device 'naa.600304801cb841001f08f1ce0cfa04ce:1' for volume open: No underlying device for major,minor
    2018-07-15T21:58:39.403Z cpu37:33315)FSS: 5353: No FS driver claimed device 'naa.600304801cb841001f08f1ce0cfa04ce:1': No underlying device for major,minor
    2018-07-15T21:58:39.404Z cpu37:33315)VC: 3551: Device rescan time 90053 msec (total number of devices 7)
    2018-07-15T21:58:39.404Z cpu37:33315)VC: 3554: Filesystem probe time 35 msec (devices probed 7 of 7)
    2018-07-15T21:58:39.404Z cpu37:33315)VC: 3556: Refresh open volume time 0 msec


    2018-07-15T21:58:46.797Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.797Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.797Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.798Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.798Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.798Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.798Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.798Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.798Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.798Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.798Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.799Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.799Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.799Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.799Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.799Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.799Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.799Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.799Z cpu32:33315)WARNING: MemSched: 15593: Group vsanperfsvc: Requested memory limit 0 KB insufficient to support effective reservation 22436 KB
    2018-07-15T21:58:46.799Z cpu32:33315)ALERT: Unable to restore Resource Pool settings for host/vim/vmvisor/vsanperfsvc. It is possible hardware or memory constraints have changed. Ple2018-07-15T21:58:46.836Z cpu18:34102)Loading module vmkapei ...





    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: lsi_mr3: mfi_Discover:339: Physical disk vmhba2:C0:T0:L0 hidden from upper layer.
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: ScsiScan: 1651: Failed to add path vmhba2:C0:T0:L0 : No connection
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: lsi_mr3: mfi_Discover:339: Physical disk vmhba2:C0:T1:L0 hidden from upper layer.
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: ScsiScan: 1651: Failed to add path vmhba2:C0:T1:L0 : No connection
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: lsi_mr3: mfi_Discover:339: Physical disk vmhba2:C0:T2:L0 hidden from upper layer.
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: ScsiScan: 1651: Failed to add path vmhba2:C0:T2:L0 : No connection
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: lsi_mr3: mfi_Discover:339: Physical disk vmhba2:C0:T3:L0 hidden from upper layer.
    2018-07-15T21:58:51.789Z cpu10:34486)WARNING: ScsiScan: 1651: Failed to add path vmhba2:C0:T3:L0 : No connection
    2018-07-15T21:58:52.346Z cpu4:34694)Config: 681: "SIOControlFlag1" = 0, Old Value: 0, (Status: 0x0)
    2018-07-15T21:58:52.774Z cpu14:34849)VisorFSRam: 700: hostdstats with (0,1303,0,0,755)

I have vSphere vCenter on the same two servers and Witness appliance hosts on one of them.

**UPDATE:**

I check `vsphere` with `vsan.disks_stats` and see below results

    /172.16.0.10/Tehran-Datacenter/computers/Cluster-1> vsan.disks_stats .
    +--------------------------------------+-------------+-------+------+------------+---------+----------+---------+
    |                                      |             |       | Num | Capacity   |         |          | Status |
    | DisplayName                          | Host        | isSSD | Comp | Total      | Used    | Reserved | Health |
    +--------------------------------------+-------------+-------+------+------------+---------+----------+---------+
    | naa.600304801cb841001f08f1ce0cfa04ce | 172.16.0.11 | SSD   | 0    | 276.43 GB | 0.00 % | 0.00 %   | OK (v3) |
    +--------------------------------------+-------------+-------+------+------------+---------+----------+---------+
    | naa.600304801cb8a3001f08ea0914333933 | 172.16.0.12 | SSD   | 0    | 276.43 GB | 0.00 % | 0.00 %   | OK (v3) |
    | naa.600304801cb8a3001f08ea8b1bef44fa | 172.16.0.12 | MD    | 56   | 1645.87 GB | 48.72 % | 4.75 %   | OK (v3) |
    +--------------------------------------+-------------+-------+------+------------+---------+----------+---------+

As you can see my first server MD hard does not exist in this list, and I think this hard exit from vsan. how to rejoin this hard to vsan?

I try check storage on first server (`172.16.0.11`) with `esxcli vsan storage list` and see below results

    [root@esxi-1:/etc] esxcli vsan storage list
    naa.600304801cb841001f08f1ce0cfa04ce
       Device: naa.600304801cb841001f08f1ce0cfa04ce
       Display Name: naa.600304801cb841001f08f1ce0cfa04ce
       Is SSD: true
       VSAN UUID: 521ae5f3-eac3-cfa7-e10d-01b2f379762c
       VSAN Disk Group UUID: 521ae5f3-eac3-cfa7-e10d-01b2f379762c
       VSAN Disk Group Name: naa.600304801cb841001f08f1ce0cfa04ce
       Used by this host: true
       In CMMDS: true
       On-disk format version: 3
       Deduplication: false
       Compression: false
       Checksum: 5051104294654162127
       Checksum OK: true
       Is Capacity Tier: false

    naa.600304801cb841001f08f22c1296cd81
       Device: naa.600304801cb841001f08f22c1296cd81
       Display Name: naa.600304801cb841001f08f22c1296cd81
       Is SSD: false
       VSAN UUID: 52f0ac26-c7b0-8f0f-6dbb-3aeddcae32f2
       VSAN Disk Group UUID: 521ae5f3-eac3-cfa7-e10d-01b2f379762c
       VSAN Disk Group Name: naa.600304801cb841001f08f1ce0cfa04ce
       Used by this host: true
       In CMMDS: false
       On-disk format version: 3
       Deduplication: false
       Compression: false
       Checksum: 13462963856806851387
       Checksum OK: true
       Is Capacity Tier: true

As you can see `In CMMDS` is false for HDD but i expected true same as another server.

How do you think it's possible to see virtual machines again in `vsanStorage`?

My data on my VMs is very important for me.

Message was edited by: Seyyed Soroosh Hosseinalipour

TheBobkin · ‎07-16-2018

Hello Seyyed,

Welcome to Communities.

"I have two SuperMicro servers"

2+1 (e.g. 2 nodes + a Witness) or running everything as bootstrapped FTT=0?

"and install `ESXI 6.0.0` on both of them and create vSAN with them and installs all VMs on `vsanStorage`. and each of them have two SSD storage with RAID 1 and two HDD with RAID 1"

+--------------------------------------+-------------+-------+------+------------+---------+----------+---------+

+--------------------------------------+-------------+-------+------+------------+---------+----------+---------+

| naa.600304801cb841001f08f1ce0cfa04ce | 172.16.0.11 | SSD | 0 | 276.43 GB | 0.00 % | 0.00 % | OK (v3) |

+--------------------------------------+-------------+-------+------+------------+---------+----------+---------+

| naa.600304801cb8a3001f08ea0914333933 | 172.16.0.12 | SSD | 0 | 276.43 GB | 0.00 % | 0.00 % | OK (v3) |

| naa.600304801cb8a3001f08ea8b1bef44fa | 172.16.0.12 | MD | 56 | 1645.87 GB | 48.72 % | 4.75 % | OK (v3) |

+--------------------------------------+-------------+-------+------+------------+---------+----------+---------+

Do/did you have the SSDs and HDDs RAID1 before being presented as disks consumed by vSAN? As you said 2 cache + 2 capacity per node and I only see one of each above.

- only RAID0 (individual R0 VG per disk) and passthrough are supported and designed to work on vSAN.

This Disk-Group is likely kaput - can't rebuild metadata of disk-group = no data (and thus why it doesn't publish the MDs):

2018-07-15T21:57:26.184Z cpu10:33562)PLOG: PLOGHandleLogEntry:320: Recovering SSD state for MD 52f0ac26-c7b0-8f0f-6dbb-3aeddcae32f2

2018-07-15T21:58:39.226Z cpu0:33525)WARNING: LSOMCommon: SSDLOG_EnumLogCB:1450: SSD corruption detected. device: naa.600304801cb841001f08f1ce0cfa04ce:2

2018-07-15T21:58:39.226Z cpu10:33562)WARNING: PLOG: PLOGEnumLogCB:411: Log enum CB failed with Corrupt RedoLog

2018-07-15T21:58:39.226Z cpu10:33562)LSOMCommon: SSDLOG_EnumLogHelper:1401: Throttled: Waiting for 1 outstanding reads

2018-07-15T21:58:39.226Z cpu0:33525)LSOMCommon: SSDLOG_IsValidLogBlk:132: Invalid version device: naa.600304801cb841001f08f1ce0cfa04ce:2

2018-07-15T21:58:39.226Z cpu0:33525)WARNING: LSOMCommon: SSDLOG_EnumLogCB:1450: SSD corruption detected. device: naa.600304801cb841001f08f1ce0cfa04ce:2

2018-07-15T21:58:39.337Z cpu7:33578)Destroyed VSAN Slab PLOGRecovSlab_0x4305644f1c70 (maxCount=32769 failCount=0)

2018-07-15T21:58:39.337Z cpu22:33742)PLOG: PLOGRecDisp:823: PLOG recovery complete 521ae5f3-eac3-cfa7-e10d-01b2f379762c:Processed 2271342 entries, Took 73154 ms

2018-07-15T21:58:39.337Z cpu22:33742)PLOG: PLOGRecDisp:832: Recovery for naa.600304801cb841001f08f1ce0cfa04ce:2 completed with Corrupt RedoLog

2018-07-15T21:58:39.337Z cpu37:33315)WARNING: PLOG: PLOGCheckRecoveryStatusForOneDevice:6702: Recovery failed for disk 521ae5f3-eac3-cfa7-e10d-01b2f379762c

Potentially one of the SSDs in the RAID1 of 172.16.0.11 is having issues or just the RAID1 is unreliable and neither of the drives have an intact dataset. Could *potentially* try exposing just one (and/or the other) of these devices directly instead of as a RAID1 but this is untested and massively YMMV.

Bob

soroshsabz · ‎07-16-2018

thanks for response

Yes, I have 2 nodes, and a witness that runs on one of my nodes.

Yes, I have hardware RAID level 1 with SSDs and HDDs. for that matter you can see one SSD and one HDD per node.

How to reconnect 172.16.0.11 MD device to vsan again?

Why PLOG could not recover my devise?

Why SSD could not initialize in 172.16.0.11?

How to recover vmdk files from 172.16.012?

My vsan storage policy is RAID 1 too.

thanks a lot

TheBobkin · ‎07-16-2018

Hello Seyyed,

"Yes, I have 2 nodes, and a witness that runs on one of my nodes."

Witness running on one of the nodes is also a very bad idea as you essentially are nesting the Fault Domains then - failure of the node (or Disk-Group here) that is backing the Witness results in 2/3 nodes being lost not 1/3 so your FTT=1 Objects are of course inaccessible.

Run this on a node and let's see how much of your data is state 13 (data-component present) vs state 12:

"Yes, I have hardware RAID level 1 with SSDs and HDDs. for that matter you can see one SSD and one HDD per node."

Why did you configure it like this? You would have been much better off providing reliability just at the vSAN level using SPBM instead of introducing this additional (unsupported) level of failure. I can understand that maybe you thought it added less SPOF but in reality it just makes it less stable and likely more probable to failure.

"How to reconnect 172.16.0.11 MD device to vsan again?"

"Why PLOG could not recover my devise?"

"Why SSD could not initialize in 172.16.0.11?"

Well potentially your 'hardware raid' isn't so good at telling how it wrote to which disk as it is going down and doesn't have a clue which mirror is current (no quorum here :smileygrin:) - this is why I was saying you could *potentially* try exposing one or other of the disks alone as the cache but I have no idea if it will see it as same naa/UUID as the SSD it saw as the cache-tier of this Disk-Group. This is of course potentially impossible and/or won't work, but so is RAID1 exposed to vSAN (and thus why it shouldn't have been like this in the first place).

"How to recover vmdk files from 172.16.012?"

See what the state of the data is - if there is healthy full R0 sets of data-components then *potentially* you could force FTT=0 it outside of SPBM (but obviously at your own risk, no guarantees and this is never advisable - restore from back-ups is simpler and less potential headaches).

Bob

soroshsabz · ‎07-17-2018

Thanks a lot for your response

I run your command on both ESXI servers and see below result:

172.16.0.11:

12 state\": 15

44 state\": 28

172.16.0.12:

12 state\": 15

44 state\": 28

How to determine FTT? and how to force to another value?

Why I could not see any of VMs in /vmfs/vsanDatastore ?

Why I could not restore my vmdks, when I can see in vsphere web client in Cluster -> Monitor -> Virtual SAN -> Virtual Disks for all VMs RAID1, 172.16.0.12 is active stale for all of them and another component is absents and say object not found and witness is active?

Is there any way to resolve )WARNING: PLOG: PLOGNotifyDisks:4036: Recovery on SSD 521ae5f3-eac3-cfa7-e10d-01b2f379762c had failed earlier, SSD not published

error?

I think all of my HDDs and SSDs is physically healthy, because I can read and write to them, but vsan does not work.

best regards

TheBobkin · ‎07-18-2018

Hello Seyyed,

" 12 state\": 15 "

These 12 Objects should be accessible use objtool getAttr to determine what they are/were

" 44 state\": 28 "

These 44 Objects are inaccessible and are stale/degraded and/or don't have quorum - have a look at one using cmmds-tool to further determine this (e.g. single available stale data component *might* be viable but probably not).

"Why I could not see any of VMs in /vmfs/vsanDatastore ?"

Is the vsanDatastore accessible and of size? (e.g. half the original size as half your DGs are gone)

Potentially the namespaces Objects are all inaccessible - do you see anything in vsanDatastore?

If all the VM namespaces are inaccessible but any viable vmdk Objects are accessible (the state 15 Objects) there are ways to attach them to a VM using placeholder disks (though to do this correctly one needs to know the exact size of the original disk, do calculations, modify various aspects of the placehodler .vmdk descriptor to point this to the Object etc.). For anyone reading this - unsupported, do so at your own risk (and if you further break things by attempting and only then *after* call support I will be very sad).

"Is there any way to resolve )WARNING: PLOG: PLOGNotifyDisks:4036: Recovery on SSD 521ae5f3-eac3-cfa7-e10d-01b2f379762c had failed earlier, SSD not published

error?"

This is basically informing you that you can't access the data on the Disk-Group as the PLOG data stored on the cache-tier is not viable - this is essentially metadata data that describes and maps to the data on the capacity-tier and if this is corrupt then there is essentially no usable Disk-Group. If you hadn't configured this as RAID1 before it is exposed to vSAN this would likely be multiple orders of magnitude less possible to occur - as I was saying previously: a last last ditch effort might be to try exposing either/both(but one at a time..) of the original SSDs as a Passthrough disk but UUID/naa references would almost surely be borked (check what is seen).

"I think all of my HDDs and SSDs is physically healthy, because I can read and write to them, but vsan does not work."

Read and write to them how/using what method?Writing to an SSD that contains vSAN data outside of vSAN is likely only a method of further corrupting it.

Bob

soroshsabz · ‎07-19-2018

Hi again,

Thanks a lot again for your response,

"Is the vsanDatastore accessible and of size?"

I see 3 VMs folder in vsanDatastore and all of them is accessible, but I expected many many VMs in this folder

I delete vsanStorage from 172.16.0.11 and then recreate it. after this work I re-run your commands and see

172.16.0.11:

44 state\": 28

13 state\": 7

172.16.0.12:

44 state\": 28

13 state\": 7

I think 8 objects is permanently lost but I see 7 objects is 13, and I see many of my vm virtual hard disk health has two parts and all of them healthy like below

At now I have a another question, how to recover these hard disks? and recover my data? at least for 13 objects have state 13

thanks a lot

soroshsabz · ‎07-26-2018

Hi again,

Did you can help me to recover my disk components?

Please help me

thanks a lot