VMware Cloud Community
aNinjaneer
Contributor
Contributor

Can't create more than four disk groups

I'm currently running vSAN 6.7U2 on Dell R730xds in a two-node configuration with witness appliance, using all NVMe. I'm unable to create a fourth disk group on either host. I've seen this issue before, trying to create four disk groups on a four-node cluster. With previous versions of ESXi I was able to create more disk groups, but I haven't been able to since the original 6.7 release build. Has anyone else seen this? I've tested it with NVMe and SATA, both with the same results. At first I thought it may be related to the PERC, as it has had timing issues with certain drives in the past, but NVMe obviously doesn't go through the PERC. I've attached screenshots of what happens. Basically, the fourth disk group doesn't show a disk format version, and I get an Operation Health warning. The first three disk groups on each host create fine, but the fourth one always fails. Same when I try for a fifth disk group.

Reply
0 Kudos
8 Replies
aNinjaneer
Contributor
Contributor

Furthermore, I'm unable to delete the stale disk group in the HTML5 client, and have to revert to the flash client to delete the disk group.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Collin,

What build number are the vCenter and hosts on? (e.g. 13006603)

Have you tried creation in the FLEX Client also?

Does this occur when you do it from the CLI? e.g.:

# esxcli vsan diskgroup add -s ssd1 -d hdd1 -d hdd2 -d hdd3

vmkernel.log from when you are performing the above may help.

Bob

Reply
0 Kudos
aNinjaneer
Contributor
Contributor

Hey Bob,

I'm on the latest GA build 13006603, though I had this same issue with multiple 6.7 and 6.7U1 builds.

I have tried with the HTML5, flex client, and CLI, and all give the same results.

I've attached the vmkernel log.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Collin,

It is OOM:

2019-05-10T17:23:06.586Z cpu2:2103807 opID=9a864b99)Created VSAN Slab BL_NodeSlab_DG_093 (objSize=304 align=64 minObj=351085 maxObj=351085 overheadObj=14043 minMemUsage=121712k maxMemUsage=121712k)

2019-05-10T17:23:06.893Z cpu2:2103807 opID=9a864b99)WARNING: LSOMCommon: LSOMSlabCreateInt:701: Failed to create slab BL_CBSlab_DG_093 for size 10240 * 351085: Out of memory

2019-05-10T17:23:06.893Z cpu2:2103807 opID=9a864b99)WARNING: LSOMCommon: LSOMSlabCreate:771: Unable to create slab "BL_CBSlab_DG_093"

2019-05-10T17:23:06.893Z cpu2:2103807 opID=9a864b99)WARNING: LSOMCommon: LSOMSlabsInit:820: Unable to create slab "BL_CBSlab"

2019-05-10T17:23:06.893Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab PLOG_TaskSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab LSOM_TaskSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab PLOG_RDTBuffer_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab PLOG_RDTSGArrayRef_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab SSDLOG_AllocMapSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab SSDLOG_LogBlkDescSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab SSDLOG_CBContextSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)Destroyed VSAN Slab BL_NodeSlab_DG_093 (maxCount=0 failCount=0)

2019-05-10T17:23:06.894Z cpu2:2103807 opID=9a864b99)WARNING: LSOMCommon: LSOM_DiskGroupCreate:1617: Unable to create slabs for disk 52c23eee-d6e0-88db-77d6-baf03238faac

2019-05-10T17:23:06.910Z cpu2:2103807 opID=9a864b99)WARNING: PLOG: PLOGInitDiskGroupMemory:7282: Failed to initialize the memory for the diskgroup 52c23eee-d6e0-88db-77d6-baf03238faac: Success

2019-05-10T17:23:06.910Z cpu2:2103807 opID=9a864b99)WARNING: PLOG: PLOGAnnounceSSD:7392: Failed to initialize the memory for the diskgroup 52c23eee-d6e0-88db-77d6-baf03238faac : Success

How much physical memory do these servers have?

If they have low amounts that are not enough for the requirements then these will need to be increased:

VMware Knowledge Base

If they are fairly standard servers with 384/512GB then you should start by increasing the memory assigned to LSOM:

VMware Knowledge Base

Bob

Reply
0 Kudos
TheBobkin
Champion
Champion

Also, why does it look like you are trying to use 15TB devices as cache-tier? Read the kb I linked above and do the calculation for required memory, this is likely implicated.

You won't benefit from anything above 600GB other than for longevity of the device so using one of these as cache-tier is not ideal - better off with a smaller faster device (also, I can't find the model you are using on the vSAN HCL either, if this is the case then I hope this is a POC and not something you plan to put in production).

Bob

Reply
0 Kudos
aNinjaneer
Contributor
Contributor

I'm familiar with the calculation for that, but I don't think the issue has to do with memory or LSOM heap size. Each server has 384GB of memory, and the servers are only showing 68.9GB consumed after creating three disk groups. I already increased the LSOM heap size to 2047, and it did the same thing.

As for the drive, it's an engineering sample. The drive is carved out to be only 600GB, even though the underlying storage media has a lot more capacity. The drive is a newly-announced (not yet released to public) drive, so it is not on the HCL list yet. It is currently in the qualification process. We're just doing some exploratory science experiments with it at this point.

Like I said, I tried the same thing on some lower capacity SATA drives previously and saw the same thing, but was hoping the issue would be sorted with newer builds, so I just stuck with three disk groups for the last project. Now I'm actually trying to see how things can scale with NVMe, so I'd like to figure this out. It seems like a bug to me, so I was wondering if anyone else had run 4+ disk groups with 6.7 on an all-flash cluster, particularly on Dell hardware.

Reply
0 Kudos
TheBobkin
Champion
Champion

"Like I said, I tried the same thing on some lower capacity SATA drives previously and saw the same thing, but was hoping the issue would be sorted with newer builds, so I just stuck with three disk groups for the last project. Now I'm actually trying to see how things can scale with NVMe, so I'd like to figure this out. It seems like a bug to me"

If you can reproduce it on hardware on the HCL then please do open a support request and we can engage engineering if necessary. While the symptoms are the same as when you encountered this previously, that doesn't necessarily mean it was the exact same cause (e.g. generic LSOM heap exhausted as opposed to slab creation failure due to resources).

"so I was wondering if anyone else had run 4+ disk groups with 6.7 on an all-flash cluster, particularly on Dell hardware."

I have seen All-Flash clusters running just fine with 4+ Disk-Groups on 6.7 builds so it could be something specific to your setup.

Bob

Reply
0 Kudos
aNinjaneer
Contributor
Contributor

Because the support process is so long and drawn out, I figured I'd start here first. I have opened a support request, and can pop in some other (supported) drives in the meantime just to prepare the files for support.

Reply
0 Kudos