PatrickDLong's Posts

@MillardJK  I really appreciate your persistence in replying to these posts despite having already resolved the issue in your environment. I agree that the best long-term solution is to move to high-... See more...
@MillardJK  I really appreciate your persistence in replying to these posts despite having already resolved the issue in your environment. I agree that the best long-term solution is to move to high-endurance boot devices, however...  in a 200+ diskless host environment spread across two remote locations, the amount of parts expense and man-hours required to implement this kind of mass overhaul of our vSphere host architecture is immense.  I'm trying to find the best possible path forward with the existing boot device configuration and your responses are giving me important data points as I chart this path. So thank you!  And also thanks to @LucianoPatrão for his valuable blog post! As an aside, I have been using a single shared-storage location to host my VMtools files ever since VMTools was decoupled from specific host version compatibility and was announced as backward and forward compatible - I forget when that was but some time ago.  I used to use a rather simplistic method of replacing the symbolic link to productLocker location, i.e. rm /productLocker ln -s /vmfs/volumes/VOLUMENAMEHERE/SharedLocker /productLocker ls -n but deprecated command in newer ESXi versions now have me doing this via pasting a more inelegant series of Powershell commands $esxName = 'HOSTNAEMHERE' $dsName = 'DATASTORENAMEHERE' $dsFolder = 'SharedLocker' $esx = Get-VMHost -Name $esxName $ds = Get-Datastore -Name $dsName $oldLocation = $esx.ExtensionData.QueryProductLockerLocation() $location = "/$($ds.ExtensionData.Info.Url.TrimStart('ds:/'))$dsFolder" $esx.ExtensionData.UpdateProductLockerLocation($location) Write-Host "Tools repository moved from" Write-Host $oldLocation Write-Host "to" Write-Host $location   I'm sure there's a cleaner way of doing this but it works for me and I haven't had time to polish it.  What I haven't been able to determine is whether this type of strategy completely removes any boot device I/O attributable to host vmTools location - and that I/O is instead redirected to the shared storage location specified in my productLocker symbolic link - imo this *should* effectively reduce my risk of I/O-induced boot device issues, along with other mitigations like redirecting scratch.
OK @MillardJK so that is slightly concerning that you were already doing everything "right" with the exception of the vmTools redirect to Ramdisk and STILL suffered corruption of your SD boot media. ... See more...
OK @MillardJK so that is slightly concerning that you were already doing everything "right" with the exception of the vmTools redirect to Ramdisk and STILL suffered corruption of your SD boot media.  I mean, how much I/O could vmTools POSSIBLY generate, especially when you are not actively upgrading vmTools on your vm's?  This is why I'm suspicious that VMware is not being 100% forthright in the amount (and sources) of I/O going to the boot device in 7.x.
Probably not of much value to you at this point since you swapped to SATADOM, but I have read about issues with the Dell IDSDM module dual-card having issues under 7.x just like any other USB-based b... See more...
Probably not of much value to you at this point since you swapped to SATADOM, but I have read about issues with the Dell IDSDM module dual-card having issues under 7.x just like any other USB-based boot device.  My current environment is 100% HPE (all diskless), but my previous environment was 100% Dell and many hosts used this mirrored SD solution.  I would like to know- had you redirected scratch to a location backed by high endurance media, like your SAN?  Or was scratch going to the IDSDM SD-cards via the OSDATA VMFS-L partition?  Seeing a lot of evidence of USB and SD media being damaged by too high a level of I/O to the device causing failures which was likely your case. Another option to lower the I/O to the boot device is to enable /UserVars/ToolsRamdisk which creates a Ramdisk on boot and serves the hosts vmtools from RAM rather than the boot device, see https://kb.vmware.com/s/article/2149257  .  Just wonder if you had performed either or both of those remediations and STILL had the failures - I think 7.x is generating more I/O to the boot device than VMware is letting on...
@continuumThis issue turned out to be related to https://kb.vmware.com/s/article/81434  "Slow storage device discovery causes bootbank/scratch to not get mounted"  Even though both of my bootbanks WE... See more...
@continuumThis issue turned out to be related to https://kb.vmware.com/s/article/81434  "Slow storage device discovery causes bootbank/scratch to not get mounted"  Even though both of my bootbanks WERE loading (USB-based micro-SD boot device on motherboard), it turns out I needed to introduce a delay into the storage-path-claim service discovery process due to the large number of SAN volumes attached to these hosts.  On many hosts the volume where I had configured the scratch location was not being enumerated/discovered in the default amount of time, so I added devListStabilityCount=30 to the kernelopt= line in boot.conf in bootbank and altbootbank and voila - scratch partition was discovered appropriately and used after every subsequent reboot.  Additional background info available in https://kb.vmware.com/s/article/2149444  Bootbank loads in /tmp/ after reboot of ESXi 7.0 Update 1 host
@BohdanKotelyak  Just a hunch, but does this host boot from USB or SD/microSD device?  ssh to the host and see if you can list the filesystem with   'ls -n'    Does the command fail to complete?  Doe... See more...
@BohdanKotelyak  Just a hunch, but does this host boot from USB or SD/microSD device?  ssh to the host and see if you can list the filesystem with   'ls -n'    Does the command fail to complete?  Does anything show up in red?  If yes then issue:  cat /var/log/vmkernel.log and look for entries like vmhba32 timed out or status in doubt.
@batkite972  True - I've spent the last decade at two different employers getting rid of spinning rust which was my #1 failure point in both ESXi hosts and SAN arrays and using on-board SD or micro-S... See more...
@batkite972  True - I've spent the last decade at two different employers getting rid of spinning rust which was my #1 failure point in both ESXi hosts and SAN arrays and using on-board SD or micro-SD cards as boot media.  Now VMware changes tack and "prefers" high-endurance boot media.  "Just add a pair of redundant SSD's or M.2 devices to your hosts" they say.  Ha! who's paying for THAT in a 200+ host environment? Maybe I'll just deduct that cost from the next licensing renewal quote I get before paying it;-) There is absolutely no reason that ESXi boot devices shouldn't continue to be used in the manner they always have been - read-once at boot time or when backing up the config and only written to for hypervisor patches and upgrades and config changes/restores.  All other I/O should be redirectable to other more high-endurance media like local storage if you have it or SAN arrays.
@habassplease see my lengthy reply in this thread:  https://communities.vmware.com/t5/ESXi-Discussions/ESXI-7-0-2-Host-Hanging-after-upgrade-from-6-5-to-7/m-p/2851891/highlight/true#M276493
@MillardJK  I'm interested to hear from you whether leaving the USB Arbitrator service stopped continued to be effective for you over time?  Many places I've seen have referenced this issue as the fa... See more...
@MillardJK  I'm interested to hear from you whether leaving the USB Arbitrator service stopped continued to be effective for you over time?  Many places I've seen have referenced this issue as the fault of the vmkusb driver exclusively but your post seems to indicate that may not be the case, or perhaps it is the interaction of the vmkusb driver with the USB Arbitrator service that is causing the issue.  Have you considered permanently disabling the USB Arbitrator service? Permanently disable the USB arbitrator service after reboot. ~ # chkconfig usbarbitrator off
Welcome to the "ESXi 7.0 U2a REALLY does NOT like your SD card boot media" party - this issue is 100% because ESXi has disconnected the filesystem of your SC card media, as evidenced by your vmkernel... See more...
Welcome to the "ESXi 7.0 U2a REALLY does NOT like your SD card boot media" party - this issue is 100% because ESXi has disconnected the filesystem of your SC card media, as evidenced by your vmkernel.log entries indicating timeouts to device mpx.vmhba32:C0:T0:L0  Let's call this Issue #1 - Read on! This is a SEPARATE issue from Issue #2 with ESXi 7.x and  USB-based (which is classified as 'low-endurance' ) devices - including SD-Card, which is hanging off your motherboard's USB hub - where in the course of normal operations ESXi 7 is corrupting that low-endurance boot media due to higher-than-previous-releases I/O to those devices - in fact ESXi 7 removed throttling of I/O to those devices altogether. See Issue #2 References below. YOUR Issue #1 is different, however - but it is likely caused by the vmkusb driver ( 0.1-1vmw.702.0.0.17867351) included in the custom ESXi image (and in fact ALL released 7.0 U2a images from any vendor including the vanilla image from VMware.)  Credit where it is due, I learned about this from excellent blog post here:  https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/#comment-512 Essentially you can most likely regain control of your host and return it to normal responsiveness by issuing esxcfg-rescan -d vmhba32 one or more times (wait a few minutes between) until the command completes without error.  This removes the dead paths to the vmhba32 device.  Then issue esxcfg-rescan vmhba32 command and the filesystem on your SD card should be rediscovered;  verify by running ls -n and make sure nothing appears in red. - your host should now be back to normal responsiveness.  You may have to restart management agents:  /etc/init.d/hostd restart    and   /etc/init.d/vpxa restart .  Evacuate your host of running vm's and proceed to mitigation options below, then reboot the host. I've also seen it referenced that stopping the USB Arbitrator service can have a positive impact in resolving this issue (obvs this removes ability to map host USB to vm's, so...)   Known mitigation options at this time include: 1 - Roll back to prior release if you can to avoid the issue altogether until it is fully understood and resolved. 2 - make sure your scratch is pointed to a high-endurance media like the DAS connected to your Perc or your PCIe SSD cards (you should be doing this anyway if using SD or USB boot media, regardless of ESXi version) 3 - enable  /UserVars/ToolsRamdisk to minimize I/O related to host-based VMTools actions from hitting your SD card. This seems to make this issue re-occur less frequently and in some cases not re-occur at all...yet.  But it's usefulness as a long-term solution are not proven at this time. 4 - Stop USB Arbitrator service after boot if you do not need to pass through host USB to vm's:  /etc/init.d/usbarbitrator stop Or, you can wait for VMware to release a specific fix for this issue in the form of a new vmkusb vib, Some have reported getting a new debugging vmkusb driver 0.1-2vmw.702.0.20.45179358 from VMware  but I have not had luck getting this from GSS to date. Or wait for U3 sometime in August which *should* contain the fix for this issue.   Issue #1 References: https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/#comment-512 https://www.dell.com/community/PowerEdge-Hardware-General/VMware-7-0-U2-losing-contact-with-SD-card/m-p/7851181/highlight/true#M68996 https://www.reddit.com/r/vmware/comments/napgvr/fyi_vmkusb_is_buggy_in_7x_local_storage_failure/ https://communities.vmware.com/t5/ESXi-Discussions/Issues-with-vmkusb-on-7-0U1c-with-SD-Card-boot/m-p/2825030   Issue #2 References https://kb.vmware.com/s/article/83376 - VMFS-L Locker partition corruption on SD cards in ESXi 7.0 https://kb.vmware.com/s/article/2149257 - High frequency of read operations on VMware Tools image may cause SD card corruption   Background references for ESXi7.x boot device storage changes you should be aware of: https://blogs.vmware.com/vsphere/2020/05/vsphere-7-esxi-system-storage-changes.html https://blogs.vmware.com/vsphere/2020/07/vsphere-7-system-storage-when-upgrading.html https://kb.vmware.com/s/article/2145210 - vSphere SSD and Flash Device Support https://kb.vmware.com/s/article/2004784 - Installing ESXi on a supported USB flash drive or SD flash card        
@MattGoddard  I know you have since started booting from SSD and the issue is no longer present... but If you want to experiment with this issue -the cause of your "Bootbank cannot be found  at path ... See more...
@MattGoddard  I know you have since started booting from SSD and the issue is no longer present... but If you want to experiment with this issue -the cause of your "Bootbank cannot be found  at path /bootbank" and extreme latency/hanging is likely due to an APD to your USB boot media which in turn could either be due to a hardware/firmware issue OR due to media corruption. As you already identified, 100% this is because your boot media is USB.  Your assumption that "after booting from a flash drive, I don't think ESXi uses it at all" is no longer correct by default for vSphere 7, which uses a modified boot partition layout - and unless you manually redirect coredump and scratch to a high-endurance storage device, you are effectively overwhelming the I/O capabilities of your boot device by streaming your system logs to it during normal operations - this sustained high data rate will eventually corrupt the boot media.  In ESXi 7, the small & large core-dump, locker, and scratch are now located on a new "ESX-OSData" partition on your boot device (provided it is large enough) which is formatted as VMFS-L.  Booting from a non-high endurance device like USB, SD card, etc. is still a supported method of running ESXi, but you MUST create a core dump file on a datastore backed by high-endurance media, and also assign scratch to a directory on a datastore backed by high-endurance media such as local HDD or SSD, shared storage datastore, etc. I would encourage you to read Niels Hagoort's excellent blog posts on the subject here: https://blogs.vmware.com/vsphere/2020/05/vsphere-7-esxi-system-storage-changes.html and  https://blogs.vmware.com/vsphere/2020/07/vsphere-7-system-storage-when-upgrading.html and also read the following KB article which describes the risk of boot media corruption: https://kb.vmware.com/s/article/83376 Sadly, I feel this is a ticking time bomb for many who upgrade to vSpere7 without knowledge the boot device formatting changes.  The recommended boot device is now an HDD or SSD.  Too bad for me and a lot of vSphere admins who spent considerable time building diskless host environments over the last 5 years and getting rid of all our spinning rust.
My issue has been escalated to VMware Engineering. - will update here when I have news to share.
@continuum Thank you for your response.  I have just retried  per the steps you have given and no luck.  I created a NEW directory in same shared storage location as all my other .locker locations f... See more...
@continuum Thank you for your response.  I have just retried  per the steps you have given and no luck.  I created a NEW directory in same shared storage location as all my other .locker locations for all other hosts, /vmfs/volumes/5axxxx0b-6cxxxx4e-8f5b-daxxxxxxx018/.locker-TESTING and added it to the ScratchConfig.ConfiguredScratchLocation advanced setting via web interface; waited a few seconds. Checked /etc/vmware/locker.conf and the changes ARE reflected there. Only one line that looks like this, my specified location value plus a <space> zero appended: /vmfs/volumes/5axxxx0b-6cxxxx4e-8f5b-daxxxxxxx018/.locker-TESTING 0 ran /sbin/auto-backup.sh Files /etc/vmware/dvsdata.db and /tmp/auto-backup.2225731//etc/vmware/dvsdata.db differ Saving current state in /bootbank Clock updated. Time: 14:21:08 Date: 02/12/2021 UTC After reboot, host shows ScratchConfig.CurrentScratchLocation as /tmp/_osdatai6lvxssn, NOT the /vmfs/volumes/5axxxx0b-6cxxxx4e-8f5b-daxxxxxxx018/.locker-TESTING that I specified. I extracted state.tgz from the bootbank and it shows /vmfs/volumes/5axxxx0b-6cxxxx4e-8f5b-daxxxxxxx018/.locker-TESTING 0 Additional info that may be helpful - this system is a diskless system booting off microSD card - the two bootbank partitions appear to be located on the SD disk as expected and as such are persistent; obviously we do not want high-volume writes (scratch) going to the SD card but rather to high-endurance storage device like shared datastore on SAN  There is a new kb https://kb.vmware.com/s/article/2149444 where these may be placed in /tmp/ after reboot due to storage-path-claim service issues but I am not experiencing issues with the bootbanks, only the scratch location - for some reason this system just continues to place my scratch location in /tmp RAMdisk instead of any persistent location that I specify.  [root@Lxxxxxxxx:~] cd bootbank [root@Lxxxxxxxx:/vmfs/volumes/601c4757-f9272e20-3e4f-7a94036000d0] cd .. [root@Lxxxxxxxx:~] cd altbootbank [root@Lxxxxxxxx:/vmfs/volumes/e28ed3b2-ac4c1b5a-9ab3-0188fa271c64] cd .. [root@Lxxxxxxxx:~] cd /scratch [root@Lxxxxxxxx:/tmp/_osdatai6lvxssn] Any other thoughts or suggestions? It accepts my setting, it saves my setting, it just does not actually SET my setting and I'm flummoxed. I do have an SR created, going through the standard "send us your logs" hoops now. 
+1 for "still an issue"  I just upgraded a ton of HPE diskless servers from 6.7U3 to 7.0.1 and no matter HOW I try to set the scratch config on SOME of them it reverts upon reboot. These are all iden... See more...
+1 for "still an issue"  I just upgraded a ton of HPE diskless servers from 6.7U3 to 7.0.1 and no matter HOW I try to set the scratch config on SOME of them it reverts upon reboot. These are all identical h/w servers, installed in an identical way, upgraded via identical methods (image profile) with identical versions of the HPE custom installer - and some of them will persist the setting across reboots and others will not.  Maddening!  I've tried all the known resolutions - upgraded ELX drivers, disabled/uninstalled ELX drivers, set scratch in the html GUI, set it via Set-VMHostAdvancedConfiguration (now deprecated) set it via Get-AdvancedSetting, made sure bootbanks were synchronized with /sbin/auto-backup.sh prior to rebooting,-read all the related KB's, changed the value when the host is in maint mode and NOT in maint mode, deleted and recreated the destination scratch folder, pointed to other folders entirely -and yet *nothing* works to resolve this for the random servers that will NOT persist this setting across reboots.  For some of the upgraded servers, every time I reboot my advanced settings show: ScratchConfig.ConfiguredScratchLocation     /vmfs/volumes/5axxxx0b-6cxxxx4e-8f5b-daxxxxxxx018/.locker-hostname ScratchConfig.CurrentScratchLocation     /tmp/_osdata4ivqeeqn  (effectively /tmp/scratch for those of you still on 6.x) I'm at my wits end with this issue that has been present in so many ESXi releases.  How can you write a value to the configuration of a host, have that command respond that it's written successfully, but then this written change is GONE on reboot as if you had never written it? How can 0's and 1's be this non-deterministic?
We're starting to get into the "When you're digging yourself into a hole - quit digging" territory.  Of course there are CLI methods to update your hosts via ZIP files, but without the benefit of... See more...
We're starting to get into the "When you're digging yourself into a hole - quit digging" territory.  Of course there are CLI methods to update your hosts via ZIP files, but without the benefit of VCSA running you have no easy way to clear off your hosts to perform the host upgrade. I would strongly recommend contacting VMware Support and working through the VCSA issues to get it back online. I'm sorry this did not go smoothly for you.  I also wish there was better QA happening for the ESXI releases.  IMO an exponential increase in health event logging growth is something that should have been caught prior to publicly releasing 6.7U3.
If you have already extended the SEAT disk on the VCSA, my opinion would be that you can just start upgrading hosts and the volume of events coming into your VCSA SEAT disk will decrease proporti... See more...
If you have already extended the SEAT disk on the VCSA, my opinion would be that you can just start upgrading hosts and the volume of events coming into your VCSA SEAT disk will decrease proportionally.  It really depends on the size of your environment as to how fast that SEAT disk capacity gets chewed up.  I'm not 100% sure but I would assume that after your hosts are upgraded that the spurious events that are already taking up space in your SEAT disk right now would eventually age out and the space would be reclaimed, obviating the need for manual truncation. I ran into this issue in early Sept, then extended my SEAT disk once by 30GB to get more free space - hoping to ride it out until a patch was released, but then still ended up having to truncate the tables every 4-6 days for weeks because the15018012 patch with the fix was not released until 11/12/2019 and the daily volume of SEAT data coming in chewed through the newly available space rather quickly.  So the truncation was basically a weekly task for me while waiting for the patch release.
You will need to follow the steps in the KB to recover space on your SEAT disk on VCSA.  This will allow vpxd service to start and stay running.  As I said, this process involves directly truncat... See more...
You will need to follow the steps in the KB to recover space on your SEAT disk on VCSA.  This will allow vpxd service to start and stay running.  As I said, this process involves directly truncating tables in the Postgres db on VCSA so proceed with extreme caution and follow the KB instructions explicitly or get VMware support to help you with this if you're not comfortable.  The usual "have a backup", etc. caveats apply. Then you will need to upgrade ALL of your 14320388 hosts to either 15018017 or 15160138 to resolve the issue of them generating spurious health alert messages to your VCSA - which is what is filling up the SEAT disk and causing vpxd to crash.  vpxd will not run if the SEAT disk is >= 95% full.  Incidentally, don't worry about the /storage/archive mount showing 100% full - that is an expected and desired status. You can easily see the spurious messages on each host if you look in the vSphere client at <select a host>  and in the right pane select Monitor >> Events.  You will see a large number of health events happening continuously on every 14320388 host.
If you are using any 57712/57712MF/57800/57800MF/57810/57810MF/57840/57840MF 10G Ethernet based chipset from Broadcom -> QLogic -> Cavium -> Marvell under ESXi 6.7 you should be usingeither  qfle... See more...
If you are using any 57712/57712MF/57800/57800MF/57810/57810MF/57840/57840MF 10G Ethernet based chipset from Broadcom -> QLogic -> Cavium -> Marvell under ESXi 6.7 you should be usingeither  qfle3 driver v 1.0.87.0 or v 1.0.86.0 (although I will mention that 1.0.86 is the latest listed in the HCL) as well as the latest firmware - the value you are looking for in the VMware HCL is the MFW value; 7.15.56 is a slightly older version - the latest is 7.15.68, downloadable from your hardware vendor.  The link for HPE is : https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_55d72ecfd98540b5b9aac0bcec#tab3 I don't actually know the precise answer to your question of what is Storm vs. MFW,  but found this in the driver Release Notes: ... ESXi 6.7/6.5:  Version 1.0.81.0 (Mar 27, 2019) Internal FW: 7.13.11.0 Enhancements: ---------- 1.  Pulled in Storm firmware version 7.13.11.0
DanPaLewis There is a bug in 6.7u3 that causes a flood of host hardware health errors.  The reason you get the "503 Service unavailable" message is that because of these event messages your SEAT ... See more...
DanPaLewis There is a bug in 6.7u3 that causes a flood of host hardware health errors.  The reason you get the "503 Service unavailable" message is that because of these event messages your SEAT disk grew to 95% full - you will need to perform the remediation steps in KB74607 that describe how to truncate event tables on your VCSA in order to free disk space on your SEAT disk. and follow the other recommendations in that KB.   Needless to say, this is a delicate operation.  Engage VMware support if you have questions about how to proceed. And then after you resolve the SEAT disk space issue, update your vCenter to latest as well as updating your ESXi 6.7U3 hosts to either: ESXi 6.7 U3a November 2019 Patch ESXi670-201911001 2019-11-12 build 15018017 ESXi 6.7 U3b December 2019 Patch ESXi670-201912001 2019-12-05 build 15160138 Good Luck!
Although the "Sensor System Chassis 1 UID" still shows as Unknown Status in the Hardware Health monitor for me after upgrade, it appears as though the 'Sensor -1 health events flooding the logs' ... See more...
Although the "Sensor System Chassis 1 UID" still shows as Unknown Status in the Hardware Health monitor for me after upgrade, it appears as though the 'Sensor -1 health events flooding the logs' issue is resolved in patch ESXi-6.7.0-20191104001 released last night. "After upgrading to ESXi 6.7 Update 3, you might see Sensor -1 type hardware health alarms on ESXi hosts being triggered without an actual problem. This can result in excessive email alerts if you have configured email notifications for hardware sensor state alarms in your vCenter Server system. These mails might cause storage issues in the vCenter Server database if the Stats, Events, Alarms and Tasks (SEAT) directory goes above the 95% threshold." Testing now... Cheers, Patrick
Although the "Sensor System Chassis 1 UID" still shows as Unknown Status in the Hardware Health monitor for me after upgrade, it appears as though the 'Sensor -1 health events flooding the logs' ... See more...
Although the "Sensor System Chassis 1 UID" still shows as Unknown Status in the Hardware Health monitor for me after upgrade, it appears as though the 'Sensor -1 health events flooding the logs' issue is resolved in patch ESXi-6.7.0-20191104001 released last night:  VMware ESXi 6.7, Patch Release ESXi670-201911001 "After upgrading to ESXi 6.7 Update 3, you might see Sensor -1 type hardware health alarms on ESXi hosts being triggered without an actual problem. This can result in excessive email alerts if you have configured email notifications for hardware sensor state alarms in your vCenter Server system. These mails might cause storage issues in the vCenter Server database if the Stats, Events, Alarms and Tasks (SEAT) directory goes above the 95% threshold." Testing now... Cheers, Patrick