VMware Cloud Community
aNinjaneer
Contributor
Contributor

Disk group creation error: stuck "doing plog recovery on ssd"

I'm setting up a new 4-node all-flash configuration for our next reference architecture, and have hit a bump. Typically we only test two disk groups, but for this RA, we are doing four disk groups. I can create two disk groups on all hosts just fine, but as as soon as I go to create the third disk group on any of the hosts, it hangs. When I check the vmkernel log, the last entry is:

"2018-09-14T20:47:17.609Z cpu0:2108062)PLOG: PLOG_Recover:904: Doing plog recovery on SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c"

In the gui, the "Create disk group on vSAN" task is stuck at 20% and it hangs until it fails. In the past, I've been able to setup clusters with 4+ DGs with no issues, so I don't know if this is a bug in new code, or if something else is going on. These are new SSDs, and I have verified that this is repeatable every time I try to create a third disk group on any of the four hosts. I even tried increasing the /LSOM/heapSize to 2047 just to be sure it wasn't an issue with that, since there are 4 1TB SSDs for the cache drives, but that didn't change anything. Has anyone else experienced this?

Here's the end of the vmkernel.log:

2018-09-14T20:47:12.709Z cpu14:2102771)LSOMCommon: LSOM_DiskGroupCreate:1512: Creating diskgroup uuid: 5264f31a-41ae-f753-5cb0-f73c38982e3c (Read cache size: 315948863488, Write buffer size: 644245086208)

2018-09-14T20:47:12.728Z cpu14:2102771)Created VSAN Slab PLOG_TaskSlab_DG_003 (objSize=1160 align=64 minObj=19861 maxObj=19861 overheadObj=794 minMemUsage=27540k maxMemUsage=27540k)

2018-09-14T20:47:12.738Z cpu14:2102771)Created VSAN Slab LSOM_TaskSlab_DG_003 (objSize=1904 align=64 minObj=25000 maxObj=25000 overheadObj=1000 minMemUsage=52000k maxMemUsage=52000k)

2018-09-14T20:47:12.739Z cpu14:2102771)Created VSAN Slab PLOG_RDTBuffer_DG_003 (objSize=280 align=64 minObj=19861 maxObj=19861 overheadObj=794 minMemUsage=6888k maxMemUsage=6888k)

2018-09-14T20:47:12.741Z cpu14:2102771)Created VSAN Slab PLOG_RDTSGArrayRef_DG_003 (objSize=40 align=64 minObj=19861 maxObj=19861 overheadObj=794 minMemUsage=1312k maxMemUsage=1312k)

2018-09-14T20:47:12.741Z cpu14:2102771)Created VSAN Slab SSDLOG_AllocMapSlab_DG_003 (objSize=12 align=64 minObj=0 maxObj=40960 overheadObj=0 minMemUsage=0k maxMemUsage=2604k)

2018-09-14T20:47:12.741Z cpu14:2102771)Created VSAN Slab SSDLOG_LogBlkDescSlab_DG_003 (objSize=4560 align=64 minObj=0 maxObj=40960 overheadObj=0 minMemUsage=0k maxMemUsage=327680k)

2018-09-14T20:47:12.741Z cpu14:2102771)Created VSAN Slab SSDLOG_CBContextSlab_DG_003 (objSize=80 align=64 minObj=0 maxObj=40960 overheadObj=0 minMemUsage=0k maxMemUsage=5288k)

2018-09-14T20:47:12.768Z cpu14:2102771)Created VSAN Slab BL_NodeSlab_DG_003 (objSize=304 align=64 minObj=351759 maxObj=351759 overheadObj=14070 minMemUsage=121944k maxMemUsage=121944k)

2018-09-14T20:47:13.526Z cpu14:2102771)Created VSAN Slab BL_CBSlab_DG_003 (objSize=10240 align=64 minObj=351759 maxObj=351759 overheadObj=14070 minMemUsage=4389948k maxMemUsage=4389948k)

2018-09-14T20:47:16.336Z cpu14:2102771)Created VSAN Slab BL_NodeKeysSlab_DG_003 (objSize=40960 align=64 minObj=328708 maxObj=328708 overheadObj=13148 minMemUsage=15041664k maxMemUsage=15041664k)

2018-09-14T20:47:16.336Z cpu14:2102771)Created VSAN Slab LSOM_SGASlab_DG_003 (objSize=400 align=64 minObj=0 maxObj=4096 overheadObj=0 minMemUsage=0k maxMemUsage=1824k)

2018-09-14T20:47:16.408Z cpu14:2102771)LSOMCommon: LSOM_GetBlkAttrCacheNumBytes:171: RC=301312MB cacheTierSize=915712MB allFlash=1 min=1120MB max=1568MB RAM=391873MB percent=0.4% maxAllowed=1568MB maxCacheSize=22400MB ConfigOptionCacheSize=0MB

2018-09-14T20:47:16.546Z cpu14:2102771)Allocated VSANBuffer 'bacache-5287564809223' of size 1644167168.

2018-09-14T20:47:16.551Z cpu14:2102771)Allocated VSANBuffer 'bacachemd-5287702665534' of size 51740672.

2018-09-14T20:47:16.565Z cpu14:2102771)LSOMCommon: SSDLOG_AddDisk:883: New SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c name: naa.500a07511e066987:2

2018-09-14T20:47:16.580Z cpu14:2102771)LSOMCommon: IORETRY_Create:2585: An IORETRY queue for diskUUID 5264f31a-41ae-f753-5cb0-f73c38982e3c (0x432073c332e0) is NOT encrypted

2018-09-14T20:47:16.580Z cpu14:2102771)LSOMCommon: IORETRY_Create:2624: Queue Depth for device 5264f31a-41ae-f753-5cb0-f73c38982e3c set to 28

2018-09-14T20:47:16.580Z cpu14:2102771)ScsiEvents: 300: EventSubsystem: Device Events, Event Mask: 20, Parameter: 0x432073c332e0, Registered!

2018-09-14T20:47:16.580Z cpu14:2102771)Created VSAN Slab SSDIORetry_slab_0000000000 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.581Z cpu14:2102771)Created VSAN Slab SSDIORetry_slab_0000000001 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.581Z cpu14:2102771)Created VSAN Slab SSDIORetry_slab_0000000002 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.582Z cpu14:2102771)Created VSAN Slab SSDIORetry_slab_0000000003 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.582Z cpu14:2102771)Created VSAN Slab SSDIORetry_slab_0000000004 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOGInitDescForIO:791: Recovering ssdlog. It might take a while... device: naa.500a07511e066987:2

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOG_IsValidCP:271: Invalid Checkpoint mismatch in SSDPageSize expected : 4096 got 0 device: naa.500a07511e066987:2

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOG_IsValidCP:271: Invalid Checkpoint mismatch in SSDPageSize expected : 4096 got 0 device: naa.500a07511e066987:2

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOG_Recover:341: Both checkpoints are invalid.. Disk needs to be initialized device: naa.500a07511e066987:2

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOGInitDescForIO:794: SSD is not initialized, initializing... device: naa.500a07511e066987:2

2018-09-14T20:47:16.582Z cpu14:2102771)LSOMCommon: SSDLOG_Recover:396: Disabling read cache device: naa.500a07511e066987:2

2018-09-14T20:47:16.591Z cpu14:2102771)LSOMCommon: SSDLOGInitDescForIO:803: Recovery complete with Success... device: naa.500a07511e066987:2

2018-09-14T20:47:16.592Z cpu14:2102771)LSOMCommon: LSOM_RegisterDiskAttrHandle:127: naa.500a07511e066987:2 is a SATA disk

2018-09-14T20:47:16.592Z cpu14:2102771)LSOMCommon: LSOM_RegisterDiskAttrHandle:131: DiskAttrHandle:0x432073c30328 is added to disk:naa.500a07511e066987:2 by module:lsomcommon

2018-09-14T20:47:16.592Z cpu14:2102771)PLOG: PLOG_DeviceCreateSSDLogHandle:1096: Registered APD callback for 5264f31a-41ae-f753-5cb0-f73c38982e3c

2018-09-14T20:47:16.602Z cpu14:2102771)LSOMCommon: IORETRY_Create:2585: An IORETRY queue for diskUUID 5264f31a-41ae-f753-5cb0-f73c38982e3c (0x432073c450f0) is NOT encrypted

2018-09-14T20:47:16.602Z cpu14:2102771)LSOMCommon: IORETRY_Create:2624: Queue Depth for device 5264f31a-41ae-f753-5cb0-f73c38982e3c set to 28

2018-09-14T20:47:16.602Z cpu14:2102771)ScsiEvents: 300: EventSubsystem: Device Events, Event Mask: 20, Parameter: 0x432073c450f0, Registered!

2018-09-14T20:47:16.603Z cpu14:2102771)Created VSAN Slab PLOGIORetry_slab_0000000000 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:16.603Z cpu14:2102771)Created VSAN Slab PLOGIORetry_slab_0000000001 (objSize=272 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=836k maxMemUsage=8336k)

2018-09-14T20:47:17.603Z cpu14:2102771)PLOG: PLOGAnnounceSSD:7253: Trace task started for device 5264f31a-41ae-f753-5cb0-f73c38982e3c

2018-09-14T20:47:17.603Z cpu14:2102771)PLOG: PLOGAnnounceSSD:7268: Successfully added VSAN SSD (naa.500a07511e066987:2) with UUID 5264f31a-41ae-f753-5cb0-f73c38982e3c. kt 1, en 0, enC 0.

2018-09-14T20:47:17.603Z cpu14:2102771)VSAN: Initializing SSD: 5264f31a-41ae-f753-5cb0-f73c38982e3c Please wait...

2018-09-14T20:47:17.603Z cpu14:2102771)VSAN: Initializing SSD: 52d963e3-aad5-b95e-9457-8ecac9edbc51 Please wait...

2018-09-14T20:47:17.603Z cpu0:2108062)PLOG: PLOGNotifyDisks:4546: MD 0 with UUID 52408ea8-f0f9-b3f3-78e8-50e3ba9af213 with state 0 formatVersion 6 backing SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c notified

2018-09-14T20:47:17.603Z cpu14:2102771)VSAN: Initializing SSD: 528b2ea8-2a6d-f7db-af27-d15e743c3d49 Please wait...

2018-09-14T20:47:17.603Z cpu0:2108062)PLOG: PLOGNotifyDisks:4546: MD 1 with UUID 52e7f360-4863-8751-31ab-63149b081f43 with state 0 formatVersion 6 backing SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c notified

2018-09-14T20:47:17.603Z cpu36:2103962)PLOG: PLOGNotifyDisks:4546: MD 0 with UUID 52327091-0421-2696-df44-d2c891b937b2 with state 0 formatVersion 6 backing SSD 52d963e3-aad5-b95e-9457-8ecac9edbc51 notified

2018-09-14T20:47:17.603Z cpu36:2103962)PLOG: PLOGNotifyDisks:4546: MD 1 with UUID 52c5f412-1214-56ae-9767-e7ea47ba75e1 with state 0 formatVersion 6 backing SSD 52d963e3-aad5-b95e-9457-8ecac9edbc51 notified

2018-09-14T20:47:17.603Z cpu48:2105033)PLOG: PLOGNotifyDisks:4546: MD 0 with UUID 528f1a28-4a47-e344-e561-6844c53ee0f6 with state 0 formatVersion 6 backing SSD 528b2ea8-2a6d-f7db-af27-d15e743c3d49 notified

2018-09-14T20:47:17.603Z cpu48:2105033)PLOG: PLOGNotifyDisks:4546: MD 1 with UUID 524c0c8c-0eb1-eff8-a41d-bc216d0f4942 with state 0 formatVersion 6 backing SSD 528b2ea8-2a6d-f7db-af27-d15e743c3d49 notified

2018-09-14T20:47:17.603Z cpu36:2103962)PLOG: PLOG_Recover:884: !!!! SSD 52d963e3-aad5-b95e-9457-8ecac9edbc51 already recovered

2018-09-14T20:47:17.603Z cpu48:2105033)PLOG: PLOG_Recover:884: !!!! SSD 528b2ea8-2a6d-f7db-af27-d15e743c3d49 already recovered

2018-09-14T20:47:17.603Z cpu0:2108062)VSANServer: VSANServer_InstantiateServer:3380: Instantiated VSANServer 0x4309ad6c3058

2018-09-14T20:47:17.605Z cpu58:2100920)Created VSAN Slab RcSsdParentsSlab_0x432073c49aa0 (objSize=208 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=668k maxMemUsage=6668k)

2018-09-14T20:47:17.606Z cpu58:2100920)Created VSAN Slab RcSsdIoSlab_0x432073c49aa0 (objSize=65536 align=64 minObj=64 maxObj=25000 overheadObj=0 minMemUsage=4352k maxMemUsage=1700000k)

2018-09-14T20:47:17.606Z cpu58:2100920)Created VSAN Slab RcSsdMdBElemSlab_0x432073c49aa0 (objSize=32 align=64 minObj=4 maxObj=4096 overheadObj=0 minMemUsage=4k maxMemUsage=264k)

2018-09-14T20:47:17.606Z cpu58:2100920)Created VSAN Slab RCInvBmapSlab_0x432073c49aa0 (objSize=56 align=64 minObj=1 maxObj=1 overheadObj=189 minMemUsage=16k maxMemUsage=16k)

2018-09-14T20:47:17.606Z cpu22:2108139)Global: Virsto_CreateInstance:163: INFO: Create new Virsto instance (heapName: virstoInstance_00000005)

2018-09-14T20:47:17.608Z cpu22:2108139)Global: Virsto_CreateInstance:163: INFO: Create new Virsto instance (heapName: virstoInstance_00000006)

2018-09-14T20:47:17.609Z cpu22:2108139)DOM: DOMDisk_GetServer:259: disk-group w/ SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c on dom/comp server 2

2018-09-14T20:47:17.609Z cpu22:2108139)LSOM: LSOMSendDiskStatusEvent:5424: Throttled: Unable to post disk status event for disk 5264f31a-41ae-f753-5cb0-f73c38982e3c: Not found

2018-09-14T20:47:17.609Z cpu22:2108139)LSOM: LSOMSSDEnumCb:227: Finished reading SSD Log for disk 5264f31a-41ae-f753-5cb0-f73c38982e3c: Success

2018-09-14T20:47:17.609Z cpu22:2108139)LSOM: LSOMRecoveryDispatch:2567: LLOG recovery complete 5264f31a-41ae-f753-5cb0-f73c38982e3c:Processed 0 entries, Recovered 0 entries, Took 0 ms. slack 0.

2018-09-14T20:47:17.609Z cpu22:2108139)LSOM: LSOMLogDiskEvent:5618: Disk Event arrival for SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c (naa.500a07511e066987:2)

2018-09-14T20:47:17.609Z cpu0:2108062)PLOG: PLOG_Recover:904: Doing plog recovery on SSD 5264f31a-41ae-f753-5cb0-f73c38982e3c

Any help is appreciated. I've attached a few log files, and can provide more if needed.

Thanks

0 Kudos
3 Replies
vpradeep01
VMware Employee
VMware Employee

Hello @aNinjaneer,

1. Could you please provide us the Controller details. Below are the commands:

  • esxcfg-scsidevs -a
  • vmkchdev -l | grep -i <vmhbaX>

Replace vmhbaX with the actual VMHBA which is in use for vSAN

2. How have you presented these drives to the IO Controller? Pass-through mode or RAID 0?

Could you please make sure all drives are presented as pass-though or RAID 0.

esxcli storage core device list | grep -i "Drive Type"

3. Have you checked as how many drives does the IO Controller which you are using supports?

4. SSD Drive and drives used in capacity tier part number or Model number please?

Thanks

0 Kudos
aNinjaneer
Contributor
Contributor

The servers are Dell R740xds with PERC H330s in pass-through/HBA mode. The drives are all Micron 5200s, 960GB MAX for cache and 4TB ECO for capacity. Like I said, we've done this configuration numerous times, including with same/similar hardware and drives, and never seen this issue. It seems like a problem on the ESXi side.

esxcfg-scsidevs -a

vmhba2  lsi_msgpt3        link-n/a  sas.5d0946601c009900                    (0000:18:00.0) Avago (LSI Logic) Dell HBA330 Adapter

vmkchdev -l | grep -i vmhba2

0000:18:00.0 1000:0097 1028:1f45 vmkernel vmhba2

esxcli storage core device list | grep -i "Drive Type"

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

   Drive Type: physical

0 Kudos
vpradeep01
VMware Employee
VMware Employee

Hello,

I don't see any issues during the configuration. Would you mind submitting a ticket with all the log bundles please?

Thanks

0 Kudos