VMware Cloud Community
Cyril2021
Contributor
Contributor

ESXi 7.0.3 PSOD

24 hours after installing 7.0.3 (vCenter and 1 of the 3 hosts), the upgraded host crashed with a PSOD.

[code]

2021-10-07T12:42:05.306Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T12:47:05.302Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T12:47:05.302Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T12:52:05.297Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T12:52:05.298Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T12:54:54.503Z cpu10:2097207)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x85 (0x45d8d177da48, 2099565) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-10-07T12:54:54.503Z cpu10:2097207)NMP: nmp_ThrottleLogForDevice:3875: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE. cmdId.initiator=0x4305de73bf00 CmdSN 0x10d
2021-10-07T12:54:54.503Z cpu10:2097207)ScsiDeviceIO: 4161: Cmd(0x45d8d177da48) 0x85, CmdSN 0x10d from world 2099565 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0
2021-10-07T12:54:54.503Z cpu10:2097207)ScsiDeviceIO: 4161: Cmd(0x45d8d177da48) 0x85, CmdSN 0x10e from world 2099565 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0
2021-10-07T12:55:35.293Z cpu3:2097677)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times
2021-10-07T12:57:05.293Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T12:57:05.294Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:02:05.288Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T13:02:05.289Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:07:05.284Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T13:07:05.287Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:12:05.282Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T13:12:05.283Z cpu18:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:17:05.276Z cpu12:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T13:17:05.276Z cpu12:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:21:53.335Z cpu3:2100702)FSS: 7377: Failed to open file 'hpilo-d0ccb0'; Requested flags 0x5, world: 2100702 [sfcb-smx], (Existing flags 0x5, world: 2102177 [sfcb-smx]): Busy
2021-10-07T13:21:53.436Z cpu3:2100702)FSS: 7377: Failed to open file 'hpilo-d0ccb0'; Requested flags 0x5, world: 2100702 [sfcb-smx], (Existing flags 0x5, world: 2102177 [sfcb-smx]): Busy
2021-10-07T13:22:05.273Z cpu19:2097364)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2021-10-07T13:22:05.274Z cpu19:2097364)StorageDevice: 7059: End path evaluation for device naa.600508b1001c472cd365969f181995a9
2021-10-07T13:22:28.917Z cpu11:2103235)VSCSI: 2973: handle 9033325545005873(GID:9009)(vscsi0:1):Added handle (refCnt = 3) to vscsiResetHandleList vscsiResetHandleCount = 1
2021-10-07T13:22:28.917Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192707
2021-10-07T13:22:28.918Z cpu10:2097352)VSCSI: 3335: handle 9033325545005873(GID:9009)(vscsi0:1):Reset [Retries: 0/0] from (vmm0:MTUSRV3)
2021-10-07T13:22:29.420Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:29.922Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:30.422Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:30.922Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:31.422Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:31.924Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:32.426Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:32.566Z cpu11:2103235)WARNING: VSCSI: 3967: handle 9033325545005873(GID:9009)(vscsi0:1):WaitForCIF: Issuing reset; number of CIF:2
2021-10-07T13:22:32.566Z cpu11:2103235)WARNING: VSCSI: 2986: handle 9033325545005873(GID:9009)(vscsi0:1):Ignoring double reset
2021-10-07T13:22:32.928Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:33.430Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:33.930Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:34.430Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:34.932Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:35.271Z cpu11:2097651)WARNING: Heartbeat: 827: PCPU 16 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2021-10-07T13:22:35.271Z cpu16:2342869)ALERT: NMI: 689: NMI IPI: RIPOFF(base):RBP:CS [0x10d913b(0x42001d600000):0x43128599fa30:0xf48] (Src 0x1, CPU16)
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1b7f0:[0x42001e6d913a]J6_NewOnDiskTxn@esx#nover+0x177 stack: 0x43128734ff50
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1b850:[0x42001e6d967d]J6CommitInMemTxn@esx#nover+0x176 stack: 0x1
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1b900:[0x42001e6d618a]J6_CommitMemTransaction@esx#nover+0xe3 stack: 0xec500000035
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1b950:[0x42001e6fbad4]Fil6_UnmapTxn@esx#nover+0x4fd stack: 0x0
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1ba60:[0x42001e6ff891]Fil6UpdateBlocks@esx#nover+0x4e2 stack: 0xff
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bae0:[0x42001e6bf3fe]Fil3UpdateBlocks@esx#nover+0xeb stack: 0x8737fa00
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bbe0:[0x42001e6c0425]Fil3_PunchFileHoleWithRetry@esx#nover+0x7e stack: 0x45389ef1bd70
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bc90:[0x42001e6c0c0d]Fil3_FileBlockUnmap@esx#nover+0x57e stack: 0x43128585b250
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bd40:[0x42001d63b5fb]FSSVec_FileBlockUnmap@vmkernel#nover+0x20 stack: 0xd3d7d2bcdeb8
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bd50:[0x42001eb28f96]CBT_Ioctl@(cbt)#<None>+0xd3 stack: 0x5e298880
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bdf0:[0x42001d65f912]DevFSFileBlockUnmap@vmkernel#nover+0x24f stack: 0x4306224157f0
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1be90:[0x42001d63b5fb]FSSVec_FileBlockUnmap@vmkernel#nover+0x20 stack: 0x430a2ac00ff0
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bea0:[0x42001db52c03]VSCSI_ExecFSSUnmap@vmkernel#nover+0x9c stack: 0x430a2ac2ad40
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bf10:[0x42001db50ead]VSCSIDoEmulHelperIO@vmkernel#nover+0x2a stack: 0x4300b8201220
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bf40:[0x42001d6d9c19]HelperQueueFunc@vmkernel#nover+0x1d2 stack: 0x45389ef20b48
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1bfe0:[0x42001d9b1775]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2021-10-07T13:22:35.271Z cpu16:2342869)0x45389ef1c000:[0x42001d6c46ff]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2021-10-07T13:22:35.434Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:35.934Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:36.436Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:36.938Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:37.440Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:37.940Z cpu10:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:38.442Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:38.943Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:39.443Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:39.943Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:40.445Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:40.947Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:41.449Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:41.951Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:42.452Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:42.955Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:43.455Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:43.957Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:44.459Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:44.960Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:45.460Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:45.960Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:46.462Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:46.962Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:47.464Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:47.964Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:48.466Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706
2021-10-07T13:22:48.966Z cpu14:2097352)VSCSI: 3226: handle 9033325545005873(GID:9009)(vscsi0:1):processing reset for handle ... state 1381192706

[/code]

Any idea what this could be?
Running on a HP DL380 G9 server with 24 disks in raid 10 ADM.

Reply
0 Kudos
20 Replies
vmwLordriki
Contributor
Contributor

Hello sir, 

 

I just came here to tell you that I have the same exact problem. I started to get PSOD about 7 days after updating my hosts to 7.0.3 and now I randomly get PSOD. Every morning a different host will PSOD...

After the first crashes... I proceeded to update all firmware/bios 

I kind of ruled out the hardware since I get the same crashes on my Lenovo SR630  and Dell R640 servers...

Here is the screenshot of PSOD....

 

I have opened a case with vmware, hopefully they will come up with a solution.... I don't want to roll back to 7.0.2 but my virtualization infrastructure was rock solid before...   

Reply
0 Kudos
Cyril2021
Contributor
Contributor

Initially I had VMware installed on the SD card inside the server. Since that is no longer recommended after 7.0.1. I have installed it on a HDD. Did have 3 more PSOD's last friday, Message was slightly different.

"pqi_is_firmware_feature_supported:0019: Invalid byte index for bit position 16" (screenshot added)

After that, also last Friday, I set up a new server with ESXi 6.7 and moved the VM's to that one. Up till now no crashes so far.

Glad I am not the only one with this issue. 2 other hosts are still running 7.0.3 and they have not crashed yet.

 

Reply
0 Kudos
e_espinel
Virtuoso
Virtuoso

Hello.

If you have servers running versions higher than 7.0.1, it is not recommended to use SD card as boot. sooner or later you will have serious problems.

In my personal opinion, I would be quieter in version 6.7 and with internal discs SAS

 

 

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.
Reply
0 Kudos
Srijithk
Enthusiast
Enthusiast

looks like this issue is with the disk type

I changed the disk type of my vm's from thin provisioned to thick provisioned eager zeroed while performing a storage vmotion and the hosts hasn't psod so far

Note: its a small test environment with very few vm's

Reply
0 Kudos
Cyril2021
Contributor
Contributor


@e_espinel wrote:

Hello.

If you have servers running versions higher than 7.0.1, it is not recommended to use SD card as boot. sooner or later you will have serious problems.

In my personal opinion, I would be quieter in version 6.7 and with internal discs SAS


Yes, that's why I installed it on the internal discs. But still got a PSOD after that  (might be an other issue though since PSOD looked different).

Reply
0 Kudos
Cyril2021
Contributor
Contributor


@avensis wrote:

Hi, did you manage to fix the problem?



Not yet no. I mean downgrading to 6.7 is not a solution, just a temporary workaround. So far it has not crashed. Will create a support request.

Reply
0 Kudos
vmwLordriki
Contributor
Contributor

Hello Cyril2021

I just got off the phone with Vmware, the L1 support told me that other customers are also experiencing those crashes after upgrading/installing 7.0.3

 

Here is the follow up email from vmware:

 

"Thanks for your time over the call.

As discussed, I have provided the information to my L2 and we will be engaging our engineering team for further investigation.
Will keep you posted once we get an update."

 

fingers crossed for a quick resolution....

Reply
0 Kudos
depping
Leadership
Leadership

you happen to have the SR number? Would like to monitor the situation internally... thanks!

vmwLordriki
Contributor
Contributor

Hello depping,

 

Here is the SR # 21266399610

Reply
0 Kudos
depping
Leadership
Leadership

Thanks!

Reply
0 Kudos
Srijithk
Enthusiast
Enthusiast

Reply
0 Kudos
AndrewBorn
Contributor
Contributor

We had not seen a PSOD in probably a year before this, and that PSOD was caused by faulty hardware that was replaced.


We updated from 6.5 to 7.0.3, starting on 8 October 2021. We had PSODs begin on 13 October 2021. We narrowed the cause to only hosts running Citrix VDI. We isolated Citrix VDI to a cluster and PSOD is only happening to those hosts. Crashes happen about 1 per day to each host. HA and DRS are turned off on this cluster.

 

VMWare has responded to our support ticket and I have read the KB article.

 

Crashes are not always associated with a power on event.  Crashes are much more likely to happen during power on of a VM. It appears they are more likely to happen if multiple VMs are powered on at the same time. We have been performing power on of Citrix VDI VMs on a host without any running VMs, then migrating to another host once the VM has booted.

 

We are working to change our VDI deployment from thin to thick provisioned disks in the hopes that this stops the PSOD events.


We have hundreds of other VMs with thin-provisioned disks, but have not seen any PSOD on other hosts. Hosts running VMs offered via Citrix with Virtual Apps and Desktops have not crashed, and we have a large number of VMs offered that way. Only hosts running Citrix VDI appear to suffer this PSOD in our environment.


The Citrix VDI are currently virtual hardware version v13. That is true of most of our VMs, except the testing pool that has been upgraded to v19.


All servers and components have the most recent firmware.

Reply
0 Kudos
Srijithk
Enthusiast
Enthusiast

Andrew,

from what I have heard, this issue arises only when an unmap request exceeds to a certain size via the guest OS

so it can be that not every vm in your environment is reaching that size in other hosts but ones which are doing are crashing the host

Reply
0 Kudos
AndrewBorn
Contributor
Contributor

Your post gave me an idea.  Our Citrix VDI use a dedicated set of datastores that are VMFS 6, which will automatically attempt to reclaim space by default.  I have changed the space reclamation setting to disabled for these datastores.  We have been experiencing host PSODs every day, usually more than one host per day, so I should be able to tell soon if this resolves the problem.

Reply
0 Kudos
Cyril2021
Contributor
Contributor

Update 3 has been revoked. Think this topic can be closed and we'll have to wait for a new update where all existing issues regarding the updates and fixes from update 3 have been solved.

Reply
0 Kudos
Cyril2021
Contributor
Contributor

Going to install 7.0 Update 3C next in 2 weeks (on the 21st). Having doubts if I should upgrade from update 3 to 3C or do a clean install on all hosts. The latter one will be very time consuming but might be the safest way to go.

Any of you did an update instead of a clean install?

Reply
0 Kudos
depping
Leadership
Leadership

I did an update from 3 to 3c, worked just fine. Than again, I run a 4 host lab, which cannot be compared to a production environment of course.

abshri
Contributor
Contributor

Hi @depping host already running on ESXi 7.0 Update 3c and host is freezing with same snippets. its a VXRAIL environment

22381981511

Reply
0 Kudos