can anybody explain me the behaivior of der llog to plog to medical disk destaging?
On normal operation or on heavy load tests i get an llog log of 2-10 GB of data per ssd.
But on resyncs i get situations where the LLOG log space grows and grows and raising the default llog congestion vaues of 16-24 GB, so all vms stalled.
I assume, that this behavior occures, when resync is syncing not-yet-written-blocks (thin space) - for example in an reconfiguration (evac diskgroup) case.
i can cover this situation by increasing the llog limits to 48-64GB, while decreasing the ssd resync congestion values to a percentage value below the llog limits.
I noticed also, that in normal situations the valuie "log free delta" is below "relog threshold" (vsish -e cat /vmkModules/lsom/disks/52cc072f-0bd9-85f5-42e0-0d246411ddbc/info - from ssd disk)
but when llog log grows and grows, i see that "log free delta" is larger den "relog threshold" and "relog count" and "relog iterations" stucks.
this new Manual page i already read.
i also read the patent.
but for my understanding i am searching for a more detailed technical explaination.
"When a data block arrives in the write buffer, a corresponding entry for it is kept in LLOG, the logical log. This is used by LSOM for log recovery on reboot."
Which LLOG Log vs. PLOG Data space usage is required when a block arrives in the write buffer?
I want to calculate, why it is possible, that the LLOG Log space is more than 1000x larger than the PLOG data space.
"At this point, we no longer need to keep the LLOG entry."
It seems that this Progress not running instantly, it seems that it runs like the Elevator since a threshold is reached but i can't find Information which Information or how to manually Trigger this in case of Problems.
Sorry, no additional info is provided around this. These are implementation details which are not relevant to customers. If problems are experienced you can reach out to customer support and they can help you move forward. Thanks,
thank you for reply.
i contacted you two weeks ago via Facebook?
The customer Support ist currently (since over 3 weeks) analysing...
I think its very important for me as customer to know the implementation details.
Thats because the only Workaround for LLOG endless growing issues currently are "Buy a 200TB 3PAR and move all data, after 1 week of moving data you can repeat operating on 3PAR..."
In the meantime i must Change values per vsish to get the System up and running (without assistance from Support), but this values are not described everywhere, so i know nothing and running in fog.
i see a similar behaviour in our environment with vsan 6.2 during maintenance mode with full data migration. Resync starts at a high bandwith, this causes log congestion on sdds (maybe cause the metal disks are under heavy load) and therefore higher write latency on the virtual machines (10ms vs 40ms and higher). If i change lsomLogCongestionLowLimitGB and lsomLogCongestionHighLimitGB to higher value
esxcfg-advcfg -s 24 /LSOM/lsomLogCongestionLowLimitGB
esxcfg-advcfg -s 48 /LSOM/lsomLogCongestionHighLimitGB
the write latency on the virtual machines increases as well, cause the resync runs at higher rate. How do you change the ssd resync congestion values? I will try to ask the vmware support to limit the resync during maintenance mode with full data migration. Maybe this is possible.
Are there any news on your issue from support?
Use at your risk, but these were provided by VMWare support when I needed to slow down resync due to LLOG increasing nonstop:
<UPDATE>Moderator edit: removed the advanced setting to prevent people from shooting themselves in the foot. These settings are only supposed to be used when instructed by support</UPDATE>
50 is the default value, lowering this number will slow down resync, apply to all hosts, no reboot necessary.
I have also run into the LLOG issue increasing where LLOG could be something like 30-40gb and PLOG still at 1-2gb. This happened when upgrading my cluster from 6.0 build 4510822 to 4600944. I never got a good clear answer from support, they said not enough cache capacity but I called BS since I had every VM on my cluster powered off and no resync activity and this was still occuring, seemed like some like of LOG leak. I am running r730xds with H730s which have a problematic history, could be some problem with the raid controller too even though i'm running HCL firmware/drivers. I couldn't get support to assist me quickly enough so eventually had to destroy the affected disk group as the node took more than 12 hours to LLOG recover (causing missing objects to resync) and the affected disk group still wouldn't mount after coming back online.
Error during LLOG leaking looked something like this on the affected node:
2016-11-27T22:57:24.924Z cpu39:44452)WARNING: VSAN: VsanSparseWriteDone:3974: Throttled: Write error on '70674b1-0c4b3858-5421-3876-4a4b-ecf4bbd4b888': token status 'IO was aborted',SCSI status 8 (OK:BUSY)
2016-11-27T22:57:26.536Z cpu7:44388)HBX: 2802: '21c03057-1a09-4711-614c-ecf4bbd4d520': HB at offset 3457024 - Waiting for timed out HB:
2016-11-27T22:57:26.536Z cpu7:44388) [HB state abcdef02 offset 3457024 gen 7 stampUS 137850214982 uuid 58390ad8-a5c516bb-6f90-ecf4bbd2f6c8 jrnl <FB 70000> drv 14.61 lockImpl 4]
2016-11-27T22:57:30.672Z cpu28:60910)LSOM: LSOM_ThrowCongestionVOB:3459: Throttled: Virtual SAN node vsan-c00-n03 maximum Memory congestion reached.
2016-11-27T22:57:41.333Z cpu1:33092)VSCSI: 2993: Retry 0 on handle 8193 still in progress after 64 seconds