For two days in a row, there was a two-second period where a VM using pvscsi produced messages in the log like the following:
Jun 24 00:14:43 don kernel: sd 1:0:0:0: SCSI error: return code = 0x00070000
Jun 24 00:14:43 don kernel: end_request: I/O error, dev sdc, sector 21394879
Jun 24 00:14:43 don kernel: printk: 3476 messages suppressed.
Jun 24 00:14:43 don kernel: Buffer I/O error on device sdc1, logical block 2674352
Jun 24 00:14:43 don kernel: lost page write due to I/O error on sdc1
Then it marked the device read-only.
I don't see any errors in any log (either vmware.log or in /var/log) that corresponds to those times.
What should the next step of debugging this be? It seems to happen about every few days.
Hello,
Is the PVSCSI device on a VMDK or RDM? Is the storage Local or Remote (SAN, iSCSI, NFS)? If you switch to the LSIlogic driver do you get similar issues? Have you checked the storage device for any errors?
Best regards,
Edward L. Haletky VMware Communities User Moderator, VMware vExpert 2009, Virtualization Practice Analyst[/url]
Now Available: 'VMware vSphere(TM) and Virtual Infrastructure Security: Securing the Virtual Environment'[/url]
Also available 'VMWare ESX Server in the Enterprise'[/url]
[url=http://www.astroarch.com/wiki/index.php/Blog_Roll]SearchVMware Pro[/url]|Blue Gears[/url]|Top Virtualization Security Links[/url]|Virtualization Security Round Table Podcast[/url]
Is the PVSCSI device on a VMDK or RDM?
VMDK
Is the storage Local or Remote (SAN, iSCSI, NFS)?
iSCSI
If you switch to the LSIlogic driver do you get similar issues?
No, they go away and it's back to working again.
Have you checked the storage device for any errors?
Nothing on the device's log and I'm having trouble correlating any
messages in /var/log/vmkernel with these events.
Have you managed to progress with this issue?
We are running ESXi 4 update 1 (219382) and have just started running into the same issues, where we have scsi errors, they seem to be a little random.
We are running on a SAN with an EVA 8000 on the backend. We did notice the SAN experiencing a performance issue at the time, but it was not a dramatic performance issue.
HI, kenner/rabie
could you please your guest OS upgraded to latest vmware tools and synced time to ESX host, then rerun it for us?
if you hit again, please post vm-support and your guest OS kernel log? thanks very much
binoche, VMware VCP, Cisco CCNA
I have found more linux VM's that have been affected across my farm.
We allready run ESXi 4.0 update 1 (219382) and RHEL 5.4 2.6.18-164.11.1 (all the latest) on my guests the scsi timeout has been set to 180:
root@guest1 ~# cat /sys/block/sd*/device/timeout
180
180
180
What I have noticed is that at the time we had issues I had odd pathing issues to my SAN storage (but only to the specific LUN the VM was on) however only Linux VM's with pvscsi was affected at the time other Linux VM's with LSI controlers and Windows VM's with PVSCSI was fine:
If you ask me it's almost as if the PVSCSI module is ignoring the timeout setting.
Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.174 cpu3:7486096)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100070592c0) to NMP device "naa.600508b400106c8700027000148a0000" failed on physical path "vmhba1:C0:T5:L15" H:0x0 D:0x28 P:0x0 Possible sens106c8700027000148a0000" failed on physical path "vmhba1:C0:T5:L15" H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
Which results in these errors on my VM's:
Feb 10 10:43:12 guest1 kernel: sd 0:0:1:0: SCSI error: return code = 0x00070000
Feb 10 10:43:12 guest1 kernel: end_request: I/O error, dev sdb, sector 19022342
I'll post vm-support if you really want to
thanks Rabie;
let me check what is the meaning of the below message?
Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.174 cpu3:7486096)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0
binoche, VMware VCP, Cisco CCNA
Hi, rabie
D:0x28 means "28h TASK SET FULL";
here is 1 kb on how to handle 0x28 on 3PAR, could you please also check this work around works or not? thanks
binoche, VMware VCP, Cisco CCNA
Thanks for the prompt and helpfull response, I still have an issue with the fact only Linux guests running PVSCSI has been affected by the queue full condition and suffered data corruption.
I have read the KB, I need to find out how this affects me as we have a couple of farms running anything from an EVA 4400, 5000, 8000 and hopefully a EVA8400 soon
R
are you using passthrough RDM or Non-passthrough RDM?
as I know, Linux OS+ passthrough RDM usually hits similar issue frequently;
if you want to change the default configuration on Production system, please contact Vmware Support first
binoche, VMware VCP, Cisco CCNA
All VMDK's we are trying to avoid using RDM's if at all possible (so far so good)
I have logged a support call with VMware, so far they wanted to write it off as an OS Vendor (RHEL) issue, but I have sent them the ESX logs recently and am waiting for feedback from that perspective.
R
please check with them the kb of QFullThreshold is also OK or not to hp eva arrays, thanks
binoche, VMware VCP, Cisco CCNA
We're having the same issue with two RHEl 5 64bit VMs and the pvscsi driver. We are using ESXi, 4.0.0, 244038 on the host side with EMC storage.
Any help would be greatly appreciated.
Did you ever get an answer from VMware on this issue? We are having similar issues and are being told that PVSCSI is not supported on RHEL 5 guests.