I just patched my servers Dell FC640 Blades from 7.01 Ud to 7.0.2c . This morning I was looking at the events in vCenter and saw this :
09/15/2021, 11:09:18 PM Device mpx.vmhba32:C0:T0:L0 has been removed or is permanently inaccessible. Affected datastores (if any): Unknown.
There are no related events.
09/15/2021, 11:09:18 PM Permanently inaccessible device mpx.vmhba32:C0:T0:L0 has no more opens. It is now safe to unmount datastores (if any) Unknown and delete the device.
There are no related events.
When I check the storage devices, vmhba32 is up and running and there are no issues with the VMs. We boot off dual SD cards (looking into replacing these). My much older cluster has been running on 7.0.1 and has not had a single issue. This cluster didn't have a problem either with 7.0.1 . Just noticed this 2 days after we upgraded.
We have done all the mitigations : tools to ramdisk / coredump to disk / scratch to persistent disk .
What vmware log can I search to see if there are more events ? The vCenter events only shows the last 100.
EDIT : It looks like the iDRAC is showing I only have 1 SD card now on this node.
There are myriad posts on this forum and others regarding issues with 7.0U2 and SD-card or USB-based boot devices due to unthrottling of I/O writes to boot media introduced U2 (which explains why you had no issues with U1) as well as other factors. These issues should have been resolved in latest patch release, so I'm not sure why you are seeing this if you are on 7.0U2c build-18426014 , but you should ensure that your iDRAC is at latest versions AND that ISDSM is running the latest firmware, - and additionally consider replacing these ISDSM module boot devices with BOSS card or other high-endurance media per Dell recommendation for ESXi 7.x hosts here: https://www.dell.com/support/manuals/en-us/vmware-esxi-7.x/vmware_esxi_7.0_gsg/getting-started-with-...
The Dual SD card firmware was at 1.13 which i patched when i did the 6.7 to 7.01 upgrade. Dell just released 1.15 which I had upgraded to before uplifting to 7.0.2c . We also patched the iDRAC to 5.0 a few days ago. Every firmware component is up to date on these blades.
We are looking to replace these with SSD as the BOSS are very hard to get a hold of due to the chip shortages. I also see VMware just released 7.02d update. I'm not holding my breath
@mbartle Sorry you are continuing to see issues. From a prior gig (100% Dell compute shop) I am quite familiar with the 11G/12G ISDSM device and I assume the 13G/14G iteration would be similar. I remember them being fairly stable with the exception of a handful of times 1 of the 2 SD cards would inexplicably physically remove itself and I would have to re-click it into position and re-mirror them - I would have to assume they were not entirely properly clicked in and seated from the factory and only slight vibrations over time revealed the issue as they worked their way out. I currently work in a 100% HPE compute shop that is 100% virtualized - all servers are diskless with single microSD with socket on the motherboard, both Gen9 and Gen10 models This is of course only anecdotal, but I upgraded in excess of 50 servers from 7.0U2a to 7.0U2c on the day the patch was released and I have not experienced any issues at all since then on any server with 7.0U2c loaded. I'm satisfied enough with the boot device stability at this point that I am going to resume my upgrade cadence for the remaining servers on 6.7. Perhaps someone else with current Dell hardware using ISDSM as the boot device can chime in here with their experience with 7.0U2c?
I would not expect 7.0U2d to provide any additional stability in this regard beyond what is provided in 7.0U2c, since the only fixlist item in the Release Notes indicates: PR 2824750: ESXi hosts in a cluster on Dell EMC PowerFlex might intermittently fail with a purple diagnostic screen due to a PCPU preemption error
The management of persistent memory in a cluster on Dell EMC PowerFlex with NVMe drives added as RDM devices might inconsistently update a PCPU preemption counter in ESXi hosts in the cluster. As a result, ESXi hosts might intermittently fail with a purple diagnostic screen.
Well 7.0.2d did nothing to help. Now I get to spend my Friday and Saturday rebuilding these hosts back to 7.01d until we can get BOSS cards. Not even going to bother with my other cluster for now
What a disaster. I've been using VMware products since the 2.x days. Really sad to see them fall this hard