Re: SCSI errors on RHEL5-64 VM using PVSCSI

kenner · ‎06-25-2009

For two days in a row, there was a two-second period where a VM using pvscsi produced messages in the log like the following:

Jun 24 00:14:43 don kernel: sd 1:0:0:0: SCSI error: return code = 0x00070000

Jun 24 00:14:43 don kernel: end_request: I/O error, dev sdc, sector 21394879

Jun 24 00:14:43 don kernel: printk: 3476 messages suppressed.

Jun 24 00:14:43 don kernel: Buffer I/O error on device sdc1, logical block 2674352

Jun 24 00:14:43 don kernel: lost page write due to I/O error on sdc1

Then it marked the device read-only.

I don't see any errors in any log (either vmware.log or in /var/log) that corresponds to those times.

What should the next step of debugging this be? It seems to happen about every few days.

Texiwill · ‎07-01-2009

Hello,

Is the PVSCSI device on a VMDK or RDM? Is the storage Local or Remote (SAN, iSCSI, NFS)? If you switch to the LSIlogic driver do you get similar issues? Have you checked the storage device for any errors?

Best regards,

Edward L. Haletky VMware Communities User Moderator, VMware vExpert 2009, Virtualization Practice Analyst[/url]
Now Available: 'VMware vSphere(TM) and Virtual Infrastructure Security: Securing the Virtual Environment'[/url]
Also available 'VMWare ESX Server in the Enterprise'[/url]
[url=http://www.astroarch.com/wiki/index.php/Blog_Roll]SearchVMware Pro[/url]|Blue Gears[/url]|Top Virtualization Security Links[/url]|Virtualization Security Round Table Podcast[/url]

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

kenner · ‎07-01-2009

Is the PVSCSI device on a VMDK or RDM?

VMDK

Is the storage Local or Remote (SAN, iSCSI, NFS)?

iSCSI

If you switch to the LSIlogic driver do you get similar issues?

No, they go away and it's back to working again.

Have you checked the storage device for any errors?

Nothing on the device's log and I'm having trouble correlating any

messages in /var/log/vmkernel with these events.

Rabie · ‎02-10-2010

Have you managed to progress with this issue?

We are running ESXi 4 update 1 (219382) and have just started running into the same issues, where we have scsi errors, they seem to be a little random.

We are running on a SAN with an EVA 8000 on the backend. We did notice the SAN experiencing a performance issue at the time, but it was not a dramatic performance issue.

binoche · ‎02-10-2010

HI, kenner/rabie

could you please your guest OS upgraded to latest vmware tools and synced time to ESX host, then rerun it for us?

if you hit again, please post vm-support and your guest OS kernel log? thanks very much

binoche, VMware VCP, Cisco CCNA

Rabie · ‎02-10-2010

I have found more linux VM's that have been affected across my farm.

We allready run ESXi 4.0 update 1 (219382) and RHEL 5.4 2.6.18-164.11.1 (all the latest) on my guests the scsi timeout has been set to 180:

~~root@guest1 ~~~# cat /sys/block/sd*/device/timeout

180

What I have noticed is that at the time we had issues I had odd pathing issues to my SAN storage (but only to the specific LUN the VM was on) however only Linux VM's with pvscsi was affected at the time other Linux VM's with LSI controlers and Windows VM's with PVSCSI was fine:

If you ask me it's almost as if the PVSCSI module is ignoring the timeout setting.

Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.174 cpu3:7486096)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.

Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100070592c0) to NMP device "naa.600508b400106c8700027000148a0000" failed on physical path "vmhba1:C0:T5:L15" H:0x0 D:0x28 P:0x0 Possible sens106c8700027000148a0000" failed on physical path "vmhba1:C0:T5:L15" H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.

Which results in these errors on my VM's:

Feb 10 10:43:12 guest1 kernel: sd 0:0:1:0: SCSI error: return code = 0x00070000

Feb 10 10:43:12 guest1 kernel: end_request: I/O error, dev sdb, sector 19022342

I'll post vm-support if you really want to

binoche · ‎02-10-2010

thanks Rabie;

let me check what is the meaning of the below message?

Feb 10 08:43:00 host1 vmkernel: 20:00:11:16.174 cpu3:7486218)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.174 cpu3:7486096)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b400106c8700027000148a0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0

binoche, VMware VCP, Cisco CCNA

binoche · ‎02-10-2010

Hi, rabie

D:0x28 means "28h TASK SET FULL";

Controlling LUN queue depth throttling in VMware ESX for 3PAR storage arrays (1008113)

here is 1 kb on how to handle 0x28 on 3PAR, could you please also check this work around works or not? thanks

binoche, VMware VCP, Cisco CCNA

Rabie · ‎02-11-2010

Thanks for the prompt and helpfull response, I still have an issue with the fact only Linux guests running PVSCSI has been affected by the queue full condition and suffered data corruption.

I have read the KB, I need to find out how this affects me as we have a couple of farms running anything from an EVA 4400, 5000, 8000 and hopefully a EVA8400 soon

R

binoche · ‎02-11-2010

are you using passthrough RDM or Non-passthrough RDM?

as I know, Linux OS+ passthrough RDM usually hits similar issue frequently;

if you want to change the default configuration on Production system, please contact Vmware Support first

binoche, VMware VCP, Cisco CCNA

Rabie · ‎02-11-2010

All VMDK's we are trying to avoid using RDM's if at all possible (so far so good)

I have logged a support call with VMware, so far they wanted to write it off as an OS Vendor (RHEL) issue, but I have sent them the ESX logs recently and am waiting for feedback from that perspective.

R

binoche · ‎02-11-2010

please check with them the kb of QFullThreshold is also OK or not to hp eva arrays, thanks

binoche, VMware VCP, Cisco CCNA

rak · ‎06-29-2010

We're having the same issue with two RHEl 5 64bit VMs and the pvscsi driver. We are using ESXi, 4.0.0, 244038 on the host side with EMC storage.

Any help would be greatly appreciated.

shaka · ‎04-12-2011

Did you ever get an answer from VMware on this issue? We are having similar issues and are being told that PVSCSI is not supported on RHEL 5 guests.