hi guys
I have a cluster made of 2 ESXi servers running build 1065491 for a month monitoring has reported me some issues with VMs - ping has been lost -
After that I found that VMs reporting losing pings were hosted in the same ESXi server. Today a issue was reported with the vCenter appliance which did not respond.... doing more research I found that all VMs hosted in that particular host were showing up this messages both Windows - Linux
source: LSI_SAS
Event ID: 129
Reset to device, \Device\RaidPort0, was issued.
source: disk
Event ID: 153
The IO operation at logical block address 142f8d for Disk 0 was retried.
kernel: [40114.926402] mptscsih: ioc0: attempting task abort! (sc=ffff8802116c3d80)
kernel: [40114.926410] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 05 57 80 00 00 40 00
kernel: [40115.055129] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8802116c3d80) (sn=529576)
any idea what could be causing this issue? or how to fix it?
Important I moved one Windows 2012 VM from non-problematic ESXi server to the problematic one and LSI_SAS and disk messages started to show up
thanks a lot guys
Hi,
I do understand the LSI SAS and mptscsi messages are coming inside the guests which are having a LSI controller. I can agree to a point that the LSI vmhba0 in the server might have nothing to do with the error messages inside the guests since you do not have the vms in the local datastore but from FC.
So, let us take a look at the usual suspects:
1. Like Memmad pointed out, from the vmkernel log, we can see if the FC lun is having any trouble and is being reported.
2. Does the ESX itself perform very slow while running any commands like esxcfg-scsidevs -m or esxcfg-rescan vmhbaX [X adapter of your FC]
3. Is the FC lun configured for multi path facing any issues, again this can be from vmkernel logs.?
4. I am sure that the swap file would be generated on the FC lun, but do have a check if the last updated swap file on the vm directory is having current time. This is to rule out any delay in the swap file writing from the guest due to FC issues.
5. There were earlier issues similar to yours like VM losing pings, slower performance, but they were in iscsi and in 4.1 kb here
It does seems that the guests are aborting or resetting their scsi commands which seems to be the issue here.
It looks to be the issue with the LSI controller hardware being used. Can you involve your hardware vendor to do a sanity test on this. ensure the BIOS of server and firmware and driver is up to date
This server is not using local storage in fact does not have disk it's booting from Storage could it be an issue with the Fiber channel card?
Firmware is missing just one version...
how I am supposed to update drivers when ESXi install them all when is installed
can you confirm which controller are you using or post the output of #lspci
interesting
Like I said these server has no internal disk but looks like controller is there
00:0c:00.0 Mass storage controller: LSI Logic / Symbios Logic LSI2004 [vmhba0]
so even when is not used by internal should be affecting the server? or is it Fiber channel HBAs?
00:11:00.0 Serial bus controller: QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA [vmhba1]
00:1b:00.0 Serial bus controller: QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA [vmhba3]
thanks
BTW in about 4 hours I am going to update all firmware in server but any input will be appreciated
The aborts generated are for vmhba0 which is the LSI controller as stated earlier...
sorry for keep asking how do you know are generated for hba0? i have not post any logs yet
Hi,
Can you post vmkernel logs file from affected ESXI host.
Regards
Mohammed
Hi,
I do understand the LSI SAS and mptscsi messages are coming inside the guests which are having a LSI controller. I can agree to a point that the LSI vmhba0 in the server might have nothing to do with the error messages inside the guests since you do not have the vms in the local datastore but from FC.
So, let us take a look at the usual suspects:
1. Like Memmad pointed out, from the vmkernel log, we can see if the FC lun is having any trouble and is being reported.
2. Does the ESX itself perform very slow while running any commands like esxcfg-scsidevs -m or esxcfg-rescan vmhbaX [X adapter of your FC]
3. Is the FC lun configured for multi path facing any issues, again this can be from vmkernel logs.?
4. I am sure that the swap file would be generated on the FC lun, but do have a check if the last updated swap file on the vm directory is having current time. This is to rule out any delay in the swap file writing from the guest due to FC issues.
5. There were earlier issues similar to yours like VM losing pings, slower performance, but they were in iscsi and in 4.1 kb here
It does seems that the guests are aborting or resetting their scsi commands which seems to be the issue here.
thanks a lot guys for your input
OK, I found this:
1. no logs. I was trying to get vmkernel for you guys
2013-07-09_0943 - karlochacon's library
second
This command is taking some time compare to the other esxi server which works normal
# esxcfg-scsidevs -m
right now I rebooting the server to update firmware on the server.
1. no logs. I was trying to get vmkernel for you guys
Hmm.. might be due to the reason, a scratch partition was not set.
This command is taking some time compare to the other esxi server which works normal
Lets hope that it could fail due to an error message.
Just in case, you can see the vmkernel activity by pressing Alt+F12 in your console of ESXi server while performing esxcfg-rescan vmhbaX
Alt+F12 shows up this
I am able to pick up this kb from the clues of the screenshot Hope that helps.
thanks a lot
yeah I was thinking about this too even when this ESXi has not hung yet
esxcli system settings kernel list -o iovDisableIR
in fact after using this
# esxcli system settings kernel set --setting=iovDisableIR -v TRUE
this command is working as it should
# esxcfg-scsidevs -m
Now I am go to add some workload to this server and monitor and get back to you guys
Great...
Also, the message
vmklnx iodm event vmhba1 frame dropped 206 times in 60s
is kinda interesting to dig out
Happy to help,
zXi
yeah I could not find anything about that message the only thing that caught my attention was the reference to vmhba1
Did you ever figure this out ?
Hi, i am seeing the same problem right now (2019) in ESX 6.5.
I can not quite follow what was wrong in this case.
Hi!
I have same issue here, any way to fix?