MillardJK
Enthusiast
Enthusiast

Issues with vmkusb on 7.0U1c with SD Card boot

Ever since upgrading my hosts to 7.0.1 b17325020, I've had issues with my boot-from-SD Card devices. In a nutshell, they disappear from the host at the adapter level (vmhba32 is no longer present in the storage device list), and only a reboot brings it back. In one host, I've replaced the media with new cards (time consuming but easy when the hardware is using RAID1 for a pair of them) on the off-chance that it was the age of the media that was causing it. This is a lab system, so I can't bring Support into the mix to figure it out, but I think something is misbehaving with the interaction between the latest vmkusb driver and the SD Card module (the hosts are Dell R620 with latest firmware/iDRAC/LCM), which shows as online & nominal in hardware management.

This all came to light when trying to upgrade to b17325551, and discovering that the bootbank was missing, so the update(s) couldn't be applied. Rebooting and then doing the update "mostly" worked: for some crazy reason, some updates--like getting VMware Tools written to the device--weren't succeeding.

I'm wondering if anyone else is seeing that sort of thing. It may be that the slow performance of the device is causing timeouts (the media is all C10/U1 or C10/U3, so the write performance is high for the class of media) that weren't "fatal" in older versions of the driver.

——
Jim Millard
Kansas City, MO USA
0 Kudos
16 Replies
MillardJK
Enthusiast
Enthusiast

More details: I've discovered that cycling the USB Arbitrator service brings the adapter back...

[root@esx3:~] esxcli storage core adapter list
HBA Name  Driver          Link State  UID                                     Capabilities         Description
--------  --------------  ----------  --------------------------------------  -------------------  -----------
vmhba0    vmw_ahci        link-n/a    sata.vmhba0                                                  (0000:00:1f.2) Intel Corporation Patsburg 6 Port SATA AHCI Controller
vmhba1    lsi_mr3         link-n/a    sas.5848f690e685cd00                                         (0000:02:00.0) Broadcom PERC H710P Mini (for monolithics)
vmhba2    intel-nvme-vmd  link-n/a    pscsi.vmhba2                                                 (0000:41:00.0) Intel Corporation NVM Express PCIe SSD DC P3600 AIC
vmhba3    intel-nvme-vmd  link-n/a    pscsi.vmhba3                                                 (0000:42:00.0) Intel Corporation NVM Express PCIe SSD DC P3600 AIC
vmhba64   qfle3i          online      iscsi.vmhba64                           Second Level Lun ID  QLogic 57800 10 Gigabit Ethernet Adapter
vmhba65   qfle3f          link-down   fcoe.200044a8420a3ced:200144a8420a3ced  Second Level Lun ID  () QLogic Inc. FCoE Adapter
vmhba66   qfle3i          online      iscsi.vmhba66                           Second Level Lun ID  QLogic 57800 10 Gigabit Ethernet Adapter
vmhba67   qfle3f          link-down   fcoe.200044a8420a3cef:200144a8420a3cef  Second Level Lun ID  () QLogic Inc. FCoE Adapter
[root@esx3:~] /etc/init.d/usbarbitrator stop
watchdog-usbarbitrator: Terminating watchdog process with PID 2102587
stopping usbarbitrator...
usbarbitrator stopped
[root@esx3:~] /etc/init.d/usbarbitrator start
usbarbitrator started
[root@esx3:~] esxcli storage core adapter list
HBA Name  Driver          Link State  UID                                     Capabilities         Description
--------  --------------  ----------  --------------------------------------  -------------------  -----------
vmhba0    vmw_ahci        link-n/a    sata.vmhba0                                                  (0000:00:1f.2) Intel Corporation Patsburg 6 Port SATA AHCI Controller
vmhba1    lsi_mr3         link-n/a    sas.5848f690e685cd00                                         (0000:02:00.0) Broadcom PERC H710P Mini (for monolithics)
vmhba2    intel-nvme-vmd  link-n/a    pscsi.vmhba2                                                 (0000:41:00.0) Intel Corporation NVM Express PCIe SSD DC P3600 AIC
vmhba3    intel-nvme-vmd  link-n/a    pscsi.vmhba3                                                 (0000:42:00.0) Intel Corporation NVM Express PCIe SSD DC P3600 AIC
vmhba32   vmkusb          link-n/a    usb.vmhba32                                                  () USB
vmhba64   qfle3i          online      iscsi.vmhba64                           Second Level Lun ID  QLogic 57800 10 Gigabit Ethernet Adapter
vmhba65   qfle3f          link-down   fcoe.200044a8420a3ced:200144a8420a3ced  Second Level Lun ID  () QLogic Inc. FCoE Adapter
vmhba66   qfle3i          online      iscsi.vmhba66                           Second Level Lun ID  QLogic 57800 10 Gigabit Ethernet Adapter
vmhba67   qfle3f          link-down   fcoe.200044a8420a3cef:200144a8420a3cef  Second Level Lun ID  () QLogic Inc. FCoE Adapter
[root@esx3:~]

Note how vmhba32 returns through this sequence. I've also noted that the BOOTBANK file links are once again responsive, and failed dependencies (like running a configuration backup) are now succeeding. It's not perfect, because the adapter is still disappearing again after some amount of time... I'm going to stop the arbitrator on one of them to see if that helps any.

——
Jim Millard
Kansas City, MO USA
MillardJK
Enthusiast
Enthusiast

More data: I've discovered that stopping the USB Arbitrator daemon is sufficient to bring the card reader back online (not a stop/start). I have three hosts, so I have also seen where stopping the Arbitrator daemon seems to keep the device from going offline, while stop/start isn't any better than "just leaving it on" like a normal boot.

Starting to think that the issue may be more related to the Arbitrator daemon than the vmkusb driver. At least, I know I can live without the USB arbitrator--not trying to give any VMs access to host USB--while living without the SD card reader is more problematic.

——
Jim Millard
Kansas City, MO USA
MillardJK
Enthusiast
Enthusiast

Over 24h of running with the arbitrator disabled, and zero issues with the USB adapter.

——
Jim Millard
Kansas City, MO USA
0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@MillardJK  I'm interested to hear from you whether leaving the USB Arbitrator service stopped continued to be effective for you over time?  Many places I've seen have referenced this issue as the fault of the vmkusb driver exclusively but your post seems to indicate that may not be the case, or perhaps it is the interaction of the vmkusb driver with the USB Arbitrator service that is causing the issue.  Have you considered permanently disabling the USB Arbitrator service? Permanently disable the USB arbitrator service after reboot. ~ # chkconfig usbarbitrator off

0 Kudos
MillardJK
Enthusiast
Enthusiast

@PatrickDLong,

Yes, it absolutely made the problem disappear, permanently.

However, I started having reliability issues with the SD Card media--it wouldn't disappear, but writes would fail, and the mirror (a dual-card module for Dell R620) would lose one of the elements, forcing rebuilds. Even after replacing media (the slow process of swapping out & rebuilding the mirror pairs), I had enough introduced issues that I started looking into other options.

I ended up settling on a SATA DOM module that was the OEM part for the R620, and between perceived higher reliability, not relying on the USB attachment, and actual performance improvements, I'm glad I made the switch.

——
Jim Millard
Kansas City, MO USA
0 Kudos
PatrickDLong
Enthusiast
Enthusiast

Probably not of much value to you at this point since you swapped to SATADOM, but I have read about issues with the Dell IDSDM module dual-card having issues under 7.x just like any other USB-based boot device.  My current environment is 100% HPE (all diskless), but my previous environment was 100% Dell and many hosts used this mirrored SD solution.  I would like to know- had you redirected scratch to a location backed by high endurance media, like your SAN?  Or was scratch going to the IDSDM SD-cards via the OSDATA VMFS-L partition?  Seeing a lot of evidence of USB and SD media being damaged by too high a level of I/O to the device causing failures which was likely your case. Another option to lower the I/O to the boot device is to enable /UserVars/ToolsRamdisk which creates a Ramdisk on boot and serves the hosts vmtools from RAM rather than the boot device, see https://kb.vmware.com/s/article/2149257  .  Just wonder if you had performed either or both of those remediations and STILL had the failures - I think 7.x is generating more I/O to the boot device than VMware is letting on...

0 Kudos
MillardJK
Enthusiast
Enthusiast

I hadn't read about IDSDM modules having issues, but I was redirecting /scratch before I upgraded to the R620 (had R610s with a single SD slot) because of known issues with SD/USB boot due to things like /scratch and core dumps going to RAMDISK.

I had read about those other remediation options before I moved to SATA-DOM, and was doing most of it (aside from moving tools to ramdisk) already.

——
Jim Millard
Kansas City, MO USA
0 Kudos
JailBreak
Hot Shot
Hot Shot

Some workaround if you need to recover ESXi  host

https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/

 

Luciano Patrão

vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
JailBreak
Hot Shot
Hot Shot

Hi Millard,

 

To fully test if the issue will not return, try to upgrade some VMware Tools on some VMs on those ESXi hosts with the issue, then after 24h or so you will see if the issue returns or not.

That something I understand that triggers the issue.

Luciano Patrão

vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
MillardJK
Enthusiast
Enthusiast

On my systems, I never had the host freeze or lock out my ability to manage the VMs: I could evacuate the host & restart, I just couldn't save any configuration changes or updates.

——
Jim Millard
Kansas City, MO USA
0 Kudos
JailBreak
Hot Shot
Hot Shot

So you are lucky 😉

All those I have discussed this issue, everyone lost access to the ESXi host and has a huge impact on production.

Luciano Patrão

vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
MillardJK
Enthusiast
Enthusiast


To fully test if the issue will not return, try to upgrade some VMware Tools on some VMs on those ESXi hosts with the issue, then after 24h or so you will see if the issue returns or not.

Interesting that you bring up Tools... Once the SD card was online again, everything seemed fine, but the underlying issue that persisted was when applying updates. Tools would upgrade just fine on the host, and VMs could be upgraded based on it, but later on--even without the device going offline, because I used my workaround for the arbitrator--VMs that were showing as "tools update available" would show "tools current."

Other updates--like even the kernel build number--would show that the host had essentially rolled back. While this was annoying & would "catch" after restarting & reapplying the updates a few times, I wasn't having other administrative symptoms, and other config changes (like vpxagent password updates) would persist just fine.

I finally got fed up and switched to SATA-DOM, and nothing has been a problem since.

——
Jim Millard
Kansas City, MO USA
0 Kudos
JailBreak
Hot Shot
Hot Shot

The VMware Tools upgrades without any issue the VMs. We do not notice any issues on the upgrade itself, but since RamDisk is use for VMware Tools image, this triggers the issue on the host after some hours.

Even if the .locker is not running on the SD Cards, the issue happen anyway,

Luciano Patrão

vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
PatrickDLong
Enthusiast
Enthusiast

OK @MillardJK so that is slightly concerning that you were already doing everything "right" with the exception of the vmTools redirect to Ramdisk and STILL suffered corruption of your SD boot media.  I mean, how much I/O could vmTools POSSIBLY generate, especially when you are not actively upgrading vmTools on your vm's?  This is why I'm suspicious that VMware is not being 100% forthright in the amount (and sources) of I/O going to the boot device in 7.x.

0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@MillardJK  I really appreciate your persistence in replying to these posts despite having already resolved the issue in your environment. I agree that the best long-term solution is to move to high-endurance boot devices, however...  in a 200+ diskless host environment spread across two remote locations, the amount of parts expense and man-hours required to implement this kind of mass overhaul of our vSphere host architecture is immense.  I'm trying to find the best possible path forward with the existing boot device configuration and your responses are giving me important data points as I chart this path. So thank you!  And also thanks to @JailBreak for his valuable blog post!

As an aside, I have been using a single shared-storage location to host my VMtools files ever since VMTools was decoupled from specific host version compatibility and was announced as backward and forward compatible - I forget when that was but some time ago.  I used to use a rather simplistic method of replacing the symbolic link to productLocker location, i.e.

rm /productLocker

ln -s /vmfs/volumes/VOLUMENAMEHERE/SharedLocker /productLocker

ls -n

but deprecated command in newer ESXi versions now have me doing this via pasting a more inelegant series of Powershell commands

$esxName = 'HOSTNAEMHERE'
$dsName = 'DATASTORENAMEHERE'
$dsFolder = 'SharedLocker'

$esx = Get-VMHost -Name $esxName
$ds = Get-Datastore -Name $dsName
$oldLocation = $esx.ExtensionData.QueryProductLockerLocation()
$location = "/$($ds.ExtensionData.Info.Url.TrimStart('ds:/'))$dsFolder"
$esx.ExtensionData.UpdateProductLockerLocation($location)

Write-Host "Tools repository moved from"
Write-Host $oldLocation
Write-Host "to"
Write-Host $location

 

I'm sure there's a cleaner way of doing this but it works for me and I haven't had time to polish it.  What I haven't been able to determine is whether this type of strategy completely removes any boot device I/O attributable to host vmTools location - and that I/O is instead redirected to the shared storage location specified in my productLocker symbolic link - imo this *should* effectively reduce my risk of I/O-induced boot device issues, along with other mitigations like redirecting scratch.

0 Kudos
JailBreak
Hot Shot
Hot Shot

Hi Patrick,

I have created a couple of scripts to check those settings and create if needed.

My first one will just check the settings and show that information to make sure which ones already have moved the scratch to a datastore, or using the default (in this case SD Cards).

###### Credentials and vCenter connection ######
$user = "uservCenter"
$pwd = "password"
$vCenter = 'vCenterIP'

Connect-VIServer $vCenter -User $user -Password $pwd
### vCenter Server Name(FQNP)
$vCentName = [system.net.dns]::GetHostByAddress($vCenter).hostname
################################################
cls

$Cluster = Get-Cluster "addCluster"
$ESXiHosts = $Cluster | Get-VMHost | where {$_.PowerState -eq "PoweredOn"} | sort
#$esxihosts = Get-VMHost | where {$_.PowerState -eq "PoweredOn"} | sort

$Output=@()
$ConfiguredLocation = "ScratchConfig.ConfiguredScratchLocation"
$CurrentLocation = "ScratchConfig.CurrentScratchLocation"
$RamDisk = "UserVars.ToolsRamdisk"
$USBCoredump = "VMkernel.Boot.allowCoreDumpOnUsb"
$USBCoredumpPartition = "VMkernel.Boot.autoPartitionCreateUSBCoreDumpPartition"
$SysGlobal = "Syslog.global.logDir"


foreach($ESXi in $ESXiHosts ){

$value1 = Get-AdvancedSetting -Entity $ESXi -Name $ConfiguredLocation
$value2 = Get-AdvancedSetting -Entity $ESXi -Name $CurrentLocation
$value3 = Get-AdvancedSetting -Entity $ESXi -Name $RamDisk
$value4 = Get-AdvancedSetting -Entity $ESXi -Name $USBCoredump
$value5 = Get-AdvancedSetting -Entity $ESXi -Name $USBCoredumpPartition
$value6 = Get-AdvancedSetting -Entity $ESXi -Name $SysGlobal

If ($value3.Value -eq $true) { $value3 = "Enable" } elseif ($value3.Value -eq $false) { $value3 = "Disable" }

$tmp = [pscustomobject] @{
"ESXi host" = $ESXi;
"Configured Location" = $value1.Value;
"Current Location" = $value2.Value;
"UserVars ToolsRamdisk" = $value3;
"Allow Core Dump On USB" = $value4.Value;
"Auto USB Core Dump Partition" = $value5.value;
"Sys Global Log Folder" = $value6.value;
}

$Output+= $tmp

}

$Output | FT *
$ThisDate = Get-Date -format dd_MM
$Path=$PSScriptRoot
$kFile = "$Path\$kFile_$vCentName-$ThisDate.csv"

$Output | Export-Csv $kFile -NoTypeInformation -UseCulture
$Output | Out-GridView

After I just need to add the datastore where I want to move the scratch and the script will create the folder and move the setting.

Connect-VIServer $vCenter -User $user -Password $pwd
### vCenter Server Name(FQNP)
$vCentName = [system.net.dns]::GetHostByAddress($vCenter).hostname
################################################
cls

$Cluster = Get-Cluster "add-Cluster"
$dsName = 'add-Datastore'
$pathPrefix = '.locker_'
$ds = Get-Datastore -Name $dsName

New-PSDrive -Location $ds -Name DS -PSProvider VimDatastore -Root '' | Out-Null

$Cluster | Get-VMHost | ForEach-Object -Process {

$folder = "$($pathPrefix)$($_.Name)"
$folderPath = "/vmfs/volumes/$($ds.Name)/$folder"
New-Item -Path "DS:\$folder" -ItemType Directory | Out-Null
Get-AdvancedSetting -Entity $_ -Name "ScratchConfig.ConfiguredScratchLocation" |
Set-AdvancedSetting -Value $folderPath -Confirm:$false | Out-Null
Get-AdvancedSetting -Entity $_ -Name "Syslog.global.logDir" |
Set-AdvancedSetting -Value "[] /scratch/log" -Confirm:$false | Out-Null
}

Remove-PSDrive -Name DS -Confirm:$false

PS:  The above I created from a script wrote by the guru @LucD 

Hope this helps you automate these changes.

Luciano Patrão

vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos