Re: Critical Error after HP EVA8000 upgrade to XCS...

Jae_Ellers · ‎06-11-2007

Will post more tomorrow, but since HP is probably working on getting this on a lot of systems now I wanted to post this.

During the firmware upgrade everything went great, with the SPs failing over and back. No issues, normal logs.

After the upgrade I rescanned and started getting these:

Jun 11 12:27:51 tekvm8 vmkernel: 11:20:24:47.862 cpu6:1038)WARNING:

LinSCSI: 4570: The physical media represented by vmhba1:7:3 has changed and

the device is in use. The device cannot be re-synchronized with the system.

This is a critical error.

Tried some things, VMware had me reboot (which worked, but sucked) and kept working at it. My SR was 189532051.

Long story short, if you clean the running VMs off of the host and rescan the FC adapters a couple times the issue is corrected. Whatever info ESX is caching or using to validate the lun is updated and vms can be moved back.

So, vmotion or maintenance mode, rescan all vmhba in order, repeat rescan, repoplulate. All is good. YMMV.

More details tomorrow.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Jae_Ellers · ‎06-18-2007

VMware says regardless of HP's statement that this can be done live the luns should be unpresented during the upgrade. I had good luck putting the hosts in maintenance mode and rescanning. Here are my notes and what the logs looked like during this process.

The following info reviews log info that shows on ESX 3.x systems during and after a SAN failover event. This information is provided so admins may determine if the ESX hosts and storage is functioning correctly.

Nomal operation: San Failover

Here are logs on tekvm4 just before the SAN failed over on June 10, 2007 for the HP XCS 6.100 firmware update.

/var/log/vmkernel

Here the active path has become inaccessible

Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.838 cpu3:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 265322, handle f63/0x5203430

Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.838 cpu3:1032)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x5203430, originSN 265322 from vmhba2:4:2

Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.838 cpu3:1032)<6>qla24xx_abort_command(1): handle to abort=2030
Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.838 cpu3:1032)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.839 cpu3:1032)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jun 10 00:42:39 tekvm4 vmkernel: 1:12:22:28.839 cpu3:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 139735, handle fa8/0x5203460

Here the path is being marked as broken and the new path being active:

Jun 10 00:43:13 tekvm4 vmkernel: 1:12:23:02.885 cpu7:1328)<6>scsi(1): RSCN database changed -0x3,0x400.
Jun 10 00:43:13 tekvm4 vmkernel: 1:12:23:02.885 cpu2:1055)scsi(1): Waiting for LIP to complete...
Jun 10 00:43:17 tekvm4 vmkernel: 1:12:23:07.499 cpu7:1324)<6>scsi(1): RSCN database changed -0x3,0x400.
Jun 10 00:43:17 tekvm4 vmkernel: 1:12:23:07.499 cpu7:1324)<6>scsi(1): RSCN database changed -0x3,0x800.
Jun 10 00:43:17 tekvm4 vmkernel: 1:12:23:07.499 cpu7:1324)<6>scsi(1): RSCN database changed -0x3,0xc00.
Jun 10 00:43:17 tekvm4 vmkernel: 1:12:23:07.499 cpu7:1324)<6>scsi(1): RSCN database changed -0x3,0x1000.
Jun 10 00:43:18 tekvm4 vmkernel: 1:12:23:08.088 cpu7:1180)<6>scsi(1): RSCN database changed -0x3,0x400.
Jun 10 00:43:18 tekvm4 vmkernel: 1:12:23:08.155 cpu7:1318)<6>scsi(1): RSCN database changed -0x3,0x400.
Jun 10 00:43:19 tekvm4 vmkernel: 1:12:23:09.143 cpu7:1074)<6>scsi(1): Port database changed.

Here some of the newly active paths are being overrun by requests and showing as busy

Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.483 cpu0:1038)SCSI: 8043: vmhba2:4:1:1 status = 8/0 0x0 0x0 0x0
Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.483 cpu0:1038)SCSI: 8062: vmhba2:4:1:1 Retry (busy)
Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.483 cpu4:1040)SCSI: 8043: vmhba2:4:1:1 status = 8/0 0x0 0x0 0x0
Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.483 cpu4:1040)SCSI: 8062: vmhba2:4:1:1 Retry (busy)
Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.534 cpu5:1038)SCSI: 8043: vmhba2:4:1:1 status = 2/0 0x6 0x29 0x0
Jun 10 00:43:34 tekvm4 vmkernel: 1:12:23:23.534 cpu5:1038)SCSI: 8074: vmhba2:4:1:1 Retry (unit attn)

After a few seconds of this the commands are aborted or completed and things settle out from the first failover with the last rety here:

Jun 10 00:43:42 tekvm4 vmkernel: 1:12:23:32.462 cpu0:1044)SCSI: 8074: vmhba2:4:4:0 Retry (unit attn)

Next the whole sequence is repeated for the failover from sp2 back to sp1

Jun 10 01:21:45 tekvm4 vmkernel: 1:13:01:35.483 cpu0:1092)VSCSI: 1878: Reset request on handle 8195 (1 outstanding commands)
Jun 10 01:21:45 tekvm4 vmkernel: 1:13:01:35.483 cpu0:1092)VSCSI: 1878: Reset request on handle 8196 (0 outstanding commands)
Jun 10 01:21:45 tekvm4 vmkernel: 1:13:01:35.483 cpu5:1049)VSCSI: 2084: Resetting handle 8195 \[0/0]
Jun 10 01:21:45 tekvm4 vmkernel: 1:13:01:35.483 cpu5:1049)SCSI: 3317: handle 3939 / orig 0x3cf711e0
Jun 10 01:21:45 tekvm4 vmkernel: 1:13:01:35.483 cpu5:1049)<6>qla24xx_abort_command(1): handle to abort=1715

This settles out here

Jun 10 01:23:45 tekvm4 vmkernel: 1:13:03:35.149 cpu3:1049)VSCSI: 1927: Completing reset on handle 8221 (0 outstanding commands)

Here begins the failover of the second EVA:

Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1120)WARNING: SCSI: 5444: READ of handleID 0xf63
Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1286)SCSI: 8043: vmhba2:4:2:1 status = 0/3 0x0 0x0 0x0
Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1286)SCSI: 8131: vmhba2:4:2:1 Retry (abort after timeout)
Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1286)SCSI: 3191: vmhba2:4:2:1 Abort cmd due to timeout, s/n=270438, attempt 1
Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1286)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x5203430, originSN 270438 from vmhba2:4:2
Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.529 cpu1:1286)<6>qla24xx_abort_command(1): handle to abort=1016

Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.530 cpu1:1286)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.530 cpu1:1286)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jun 10 02:10:01 tekvm4 vmkernel: 1:13:49:51.530 cpu1:1286)SCSI: 3204: vmhba2:4:2:1 Abort cmd on timeout succeeded, s/n=270438, attempt 1

Finally settling out here

Jun 10 03:19:42 tekvm4 vmkernel: 1:14:59:31.558 cpu4:1044)SCSI: 8074: vmhba2:4:4:0 Retry (unit attn)

Here are some abnormalities in various sytem logs seen after the upgrade. This are assumed to be false critical errors caused by data changes to the LUN or Volume information by the firmware upgrade. ESX sees this as possible disk changes or corruption.

Here's tekvm8 /var/log/vmkernel after a esxcfg-rescan vmhba1 showing the critical error after rescanning.

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.893 cpu3:1037)SCSI: 8266: Starting rescan of adapter vmhba1

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.893 cpu0:1024)VMNIX: VmkDev: 2478: tag disk remove: vmhba1:0:0

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.893 cpu3:1037)SCSI: 1545: Device vmhba1:1:0 has not been identified as being attached to an active/passive SAN. It is either attached to an active/active SAN or is a local device.

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.893 cpu3:1037)SCSI: 1551: Device vmhba1:1:0 user configured as having a non constant serial number in page 0x80

Jun 11 12:26:31 tekvm8 vmkernel: Vendor: HP Model: HSV210 Rev: 6100

Jun 11 12:26:31 tekvm8 vmkernel: Type: Unknown ANSI SCSI revision: 05

Jun 11 12:26:31 tekvm8 vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba1:0:0 : 0x0 0x1 0x80 0x81 0x83 0x87 0x86 0xd0 0x0

Jun 11 12:26:31 tekvm8 vmkernel: VMWARE SCSI Id: Device id info for vmhba1:0:0: 0x1 0x3 0x0 0x8 0x50 0x0 0x1f 0xe1 0x50 0xb 0xe0 0x80 0x1 0x14 0x0 0x4 0x0 0x0 0x0 0x1

Jun 11 12:26:31 tekvm8 vmkernel: VMWARE SCSI Id: Id for vmhba1:0:0 0x50 0x00 0x1f 0xe1 0x50 0x0b 0xe0 0x80 0x48 0x53 0x56 0x32 0x31 0x30 Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.894 cpu3:1037)WARNING: LinSCSI: 4564: The physical media represented by vmhba1:0:0 has changed. The device has been re-synchronized with the system.

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.894 cpu3:1037)WARNING: LinSCSI: 4570: The physical media represented by vmhba1:0:1 has changed and the device is in use. The device cannot be re-synchronized with the system. This is a critical error.

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.894 cpu3:1037)WARNING: LinSCSI: 4570: The physical media represented by vmhba1:0:2 has changed and the device is in use. The device cannot be re-synchronized with the system. This is a critical error.

Jun 11 12:26:31 tekvm8 vmkernel: 11:20:23:27.894 cpu3:1037)WARNING: LinSCSI: 4570: The physical media represented by vmhba1:0:3 has changed and the device is in use. The device cannot be re-synchronized with the system. This is a critical error.

After vmotioning all vms off of the system, here's what a rescan looks like that actually resynchronizes the LUN or volume info correctly:

kvm8 vmkernel: Vendor: HP Model: HSV210 Rev: 6100

Jun 11 12:26:39 tekvm8 vmkernel: Type: Unknown ANSI SCSI revision: 05

Jun 11 12:26:39 tekvm8 vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba2:7:0 : 0x0 0x1 0x80 0x81 0x83 0x87 0x86 0xd0 0x0

Jun 11 12:26:39 tekvm8 vmkernel: VMWARE SCSI Id: Device id info for vmhba2:7:0: 0x1 0x3 0x0 0x8 0x50 0x0 0x1f 0xe1 0x50 0x6 0xb2 0x30 0x1 0x14 0x0 0x4 0x0 0x0 0x0 0x8

Jun 11 12:26:39 tekvm8 vmkernel: VMWARE SCSI Id: Id for vmhba2:7:0 0x50 0x00 0x1f 0xe1 0x50 0x06 0xb2 0x30 0x48 0x53 0x56 0x32 0x31 0x30

Jun 11 12:26:39 tekvm8 vmkernel: 11:20:23:36.172 cpu1:1038)WARNING: LinSCSI: 4564: The physical media represented by vmhba2:7:0 has changed. The device has been re-synchronized with the system.

Finally after another rescan the info shows normally as this:

Jun 11 14:16:47 tekvm8 vmkernel: 0:00:00:50.185 cpu4:1038)LinSCSI: 689: Queue depth for device vmhba2:7:0 is 30

Jun 11 14:16:47 tekvm8 vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba2:7:0 : 0x0 0x1 0x80 0x81 0x83 0x87 0x86 0xd0 0x0

Jun 11 14:16:47 tekvm8 vmkernel: VMWARE SCSI Id: Device id info for vmhba2:7:0: 0x1 0x3 0x0 0x8 0x50 0x0 0x1f 0xe1 0x50 0x6 0xb2 0x30 0x1 0x14 0x0 0x4 0x0 0x0 0x0 0x8

Jun 11 14:16:47 tekvm8 vmkernel: VMWARE SCSI Id: Id for vmhba2:7:0 0x50 0x00 0x1f 0xe1 0x50 0x06 0xb2 0x30 0x48 0x53 0x56 0x32 0x31 0x30

So the fix for this "critical error" is to

Vmotion all running vms off the host.

Put host in maintenance mode

esxcfg-rescan vmhba1

esxcfg-rescan vmhba2

esxcfg-rescan vmhba1

esxcfg-rescan vmhba2

Exit maintenance mode

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-

All

Critical Error after HP EVA8000 upgrade to XCS 6.100