escapem2
Enthusiast
Enthusiast

IO issues VMs in one host

Jump to solution

hi guys

I have a cluster made of 2 ESXi servers running build 1065491 for a month monitoring has reported me some issues with VMs - ping has been lost -

After that I found that VMs reporting losing pings were hosted in the same ESXi server. Today a issue was reported with the vCenter appliance which did not respond.... doing more research I found that all VMs hosted in that particular host were showing up this messages both Windows - Linux

source: LSI_SAS

Event ID: 129

Reset to device, \Device\RaidPort0, was issued.

source: disk

Event ID: 153

The IO operation at logical block address 142f8d for Disk 0 was retried.

kernel: [40114.926402] mptscsih: ioc0: attempting task abort! (sc=ffff8802116c3d80)

kernel: [40114.926410] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 05 57 80 00 00 40 00

kernel: [40115.055129] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8802116c3d80) (sn=529576)

any idea what could be causing this issue? or how to fix it?

Important I moved one Windows 2012 VM from non-problematic ESXi server to the problematic one and LSI_SAS and disk messages started to show up

thanks a lot guys

Tags (1)
0 Kudos
1 Solution

Accepted Solutions
zXi_Gamer
Virtuoso
Virtuoso

Hi,

     I do understand the LSI SAS and mptscsi messages are coming inside the guests which are having a LSI controller. I can agree to a point that the LSI vmhba0 in the server might have nothing to do with the error messages inside the guests since you do not have the vms in the local datastore but from FC.

So, let us take a look at the usual suspects:

1. Like Memmad pointed out, from the vmkernel log, we can see if the FC lun is having any trouble and is being reported.

2. Does the ESX itself perform very slow while running any commands like esxcfg-scsidevs -m or esxcfg-rescan vmhbaX [X adapter of your FC]

3. Is the FC lun configured for multi path facing any issues, again this can be from vmkernel logs.?

4. I am sure that the swap file would be generated on the FC lun, but do have a check if the last updated swap file on the vm directory is having current time. This is to rule out any delay in the swap file writing from the guest due to FC issues.

5. There were earlier issues similar to yours like VM losing pings, slower performance, but they were in iscsi and in 4.1 kb here

It does seems that the guests are aborting or resetting their scsi commands which seems to be the issue here.

View solution in original post

0 Kudos
18 Replies
admin
Immortal
Immortal

It looks to be the issue with the LSI controller hardware being used. Can you involve your hardware vendor to do a sanity test on this. ensure the BIOS of server and firmware and driver is up to date

escapem2
Enthusiast
Enthusiast

This server is not using local storage in fact does not have disk it's booting from Storage could it be an issue with the Fiber channel card?

Firmware is missing just one version...

how I am supposed to update drivers when ESXi install them all when is installed

0 Kudos

can you confirm which controller are you using or post the output of #lspci

Please consider marking this answer "correct" or "helpful" if you found it useful.
escapem2
Enthusiast
Enthusiast

interesting

Like I said these server has no internal disk but looks like controller is there

00:0c:00.0 Mass storage controller: LSI Logic / Symbios Logic LSI2004 [vmhba0]

so even when is not used by internal should be affecting the server? or is it Fiber channel HBAs?

00:11:00.0 Serial bus controller: QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA [vmhba1]

00:1b:00.0 Serial bus controller: QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA [vmhba3]

thanks

BTW in about 4 hours I am going to update all firmware in server but any input will be appreciated

0 Kudos
admin
Immortal
Immortal

The aborts generated are for vmhba0 which is the LSI controller as stated earlier...

0 Kudos
escapem2
Enthusiast
Enthusiast

sorry for keep asking how do you know are generated for hba0? i have not post any logs yet

0 Kudos
memaad
Commander
Commander

Hi,

Can you post vmkernel logs file from affected ESXI host.

Regards

Mohammed

Mohammed Emaad |VCP 3, 4,5 |VCP -NV 6 | VCP-DT 51 | vCAP4-DCA | VCAP5DCA | | Mark it as helpful or correct if my suggestion is useful.
0 Kudos
zXi_Gamer
Virtuoso
Virtuoso

Hi,

     I do understand the LSI SAS and mptscsi messages are coming inside the guests which are having a LSI controller. I can agree to a point that the LSI vmhba0 in the server might have nothing to do with the error messages inside the guests since you do not have the vms in the local datastore but from FC.

So, let us take a look at the usual suspects:

1. Like Memmad pointed out, from the vmkernel log, we can see if the FC lun is having any trouble and is being reported.

2. Does the ESX itself perform very slow while running any commands like esxcfg-scsidevs -m or esxcfg-rescan vmhbaX [X adapter of your FC]

3. Is the FC lun configured for multi path facing any issues, again this can be from vmkernel logs.?

4. I am sure that the swap file would be generated on the FC lun, but do have a check if the last updated swap file on the vm directory is having current time. This is to rule out any delay in the swap file writing from the guest due to FC issues.

5. There were earlier issues similar to yours like VM losing pings, slower performance, but they were in iscsi and in 4.1 kb here

It does seems that the guests are aborting or resetting their scsi commands which seems to be the issue here.

0 Kudos
escapem2
Enthusiast
Enthusiast

thanks a lot guys for your input

OK, I found this:

1. no logs. I was trying to get vmkernel for you guys

2013-07-09_0943 - karlochacon's library

second

This command is taking some time compare to the other esxi server which works normal

# esxcfg-scsidevs -m


right now I rebooting the server to update firmware on the server.

0 Kudos
zXi_Gamer
Virtuoso
Virtuoso

1. no logs. I was trying to get vmkernel for you guys

Hmm.. might be due to the reason, a scratch partition was not set.

This command is taking some time compare to the other esxi server which works normal

Lets hope that it could fail due to an error message.

Just in case, you can see the vmkernel activity by pressing Alt+F12 in your console of ESXi server while performing esxcfg-rescan vmhbaX

0 Kudos
escapem2
Enthusiast
Enthusiast
0 Kudos
zXi_Gamer
Virtuoso
Virtuoso

I am able to pick up this kb from the clues of the screenshot Smiley Happy Hope that helps.

0 Kudos
escapem2
Enthusiast
Enthusiast

thanks a lot

yeah I was thinking about this too even when this ESXi has not hung yet

esxcli system settings kernel list -o iovDisableIR

0 Kudos
escapem2
Enthusiast
Enthusiast

in fact after using this

# esxcli system settings kernel set --setting=iovDisableIR -v TRUE

this command is working as it should

# esxcfg-scsidevs -m


Smiley Happy


Now I am go to add some workload to this server and monitor and get back to you guys

0 Kudos
zXi_Gamer
Virtuoso
Virtuoso

Great...

Also, the message

vmklnx iodm event vmhba1 frame dropped 206 times in 60s

is kinda interesting to dig out Smiley Happy

Happy to help,

zXi

0 Kudos
escapem2
Enthusiast
Enthusiast

yeah I could not find anything about that message the only thing that caught my attention was the reference to vmhba1

0 Kudos
magicman1223
Contributor
Contributor

Did you ever figure this out ?

0 Kudos
wilga1995
Contributor
Contributor

Hi, i am seeing the same problem right now (2019) in ESX 6.5.

I can not quite follow what was wrong in this case.

0 Kudos