VMware Cloud Community
js_opdebeeck
Enthusiast
Enthusiast

ESX AsyncIO ...

Hello

I'm running ESX 3.0.1 and I've some trouble with a storage ... result : One of my VM Host hang and crash ... its VirtualDisk was inconsistent .

Esx on DELL PE + SAS Disk Bay. No physical problem (disks and connection).

I've open a ticket to my Dell Gold support ... but they don't have idea to help us.

-


If you have an idea about that and how I can avoid future error(s) like this.

Some info from vmkernel log:

Sept 22 , no hang but can help to understant the log.

Sep 22 14:33:04 esxdev vmkernel: 242:09:27:22.353 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 349347, handle 748fc/0x7202cd0

Sep 22 14:33:04 esxdev vmkernel: 242:09:27:22.353 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 349347 from vmhba1:3:0

Sep 22 14:33:04 esxdev vmkernel: 242:09:27:22.353 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=349347, status=bad0001, retval=bad0001

Sept 29 the problem itself

Sep 29 14:25:56 esxdev vmkernel: 249:09:20:14.808 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:25:56 esxdev vmkernel: 249:09:20:14.808 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:25:56 esxdev vmkernel: 249:09:20:14.808 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:01 esxdev vmkernel: 249:09:20:19.809 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:01 esxdev vmkernel: 249:09:20:19.809 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:01 esxdev vmkernel: 249:09:20:19.809 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:06 esxdev vmkernel: 249:09:20:24.811 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:06 esxdev vmkernel: 249:09:20:24.811 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:06 esxdev vmkernel: 249:09:20:24.811 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:11 esxdev vmkernel: 249:09:20:29.811 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:11 esxdev vmkernel: 249:09:20:29.811 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:11 esxdev vmkernel: 249:09:20:29.811 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:16 esxdev vmkernel: 249:09:20:34.812 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:16 esxdev vmkernel: 249:09:20:34.812 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:16 esxdev vmkernel: 249:09:20:34.812 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:21 esxdev vmkernel: 249:09:20:39.814 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:21 esxdev vmkernel: 249:09:20:39.814 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:21 esxdev vmkernel: 249:09:20:39.814 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:26 esxdev vmkernel: 249:09:20:44.816 cpu1:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 550848, handle 748fc/0x7202cd0

Sep 29 14:26:26 esxdev vmkernel: 249:09:20:44.816 cpu1:1028)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x7202cd0, originSN 550848 from vmhba1:3:0

Sep 29 14:26:26 esxdev vmkernel: 249:09:20:44.816 cpu1:1028)LinSCSI: 3612: Abort failed for cmd with serial=550848, status=bad0001, retval=bad0001

Sep 29 14:26:31 esxdev vmkernel: 249:09:20:49.226 cpu0:1036)FS3: 4052: Reclaimed timed out heartbeat [HB state abcdef02 offset 3942912 gen 158 stamp 21547209727819 uuid 45b40e18-11

d1b6ae-2ddf-00188b7adc6c jrnl ]

Thanks a lot.

Js Op de Beeck

0 Kudos
3 Replies
Texiwill
Leadership
Leadership

Hello,

On this storage box do you have more than one host accessing the same LUN? Does this storage box have independent controllers or does is just use the perc card(s) in the hosts?

If the same LUN is shared by multiple hosts and there are no independent controllers within the disk tray then you have a SCSI reservation problem. A single disk tray without independent controllers (See HP MSA500 or Dell equivalent for what I mean by independent controllers.)

If this is a standalone box, then do you have multiple connections to the disk tray?

Generally this states there is a hardware issue somewhere if it is a standalone.

Could you explain more about how the device is setup?

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
js_opdebeeck
Enthusiast
Enthusiast

Hello;

It's a local SAS box ... and only one host has VMDisks into it (it's a chance). The uptime is 153 Days ... no issue with it .

  • Currently Dell suggests

    • to migrate to ESX 3.2

    • and to upgrade all Firmware ... not helpfull?:|** .

My goal is to understand why this problem happens, not to move to an other configuration (I'm on a production Server).

Js

0 Kudos
dominic7
Virtuoso
Virtuoso

I'm still a bit fuzzy on your hardware configuration. You have a single host with a perc/e connected to an external SAS disk shelf? If so, is it correct that this is the only host that is accessing the disk shelf?

Assuming all the above is true, then it's most likely a hardware problem. You will want to upgrade the firmware on the perc5/e as earlier versions of the firmware were known to cause issues with ESX.

0 Kudos