IB_IT
Expert
Expert

failing IO due to too many SCSI reservations

All, I have three ESX hosts (3.5 U4) in a cluster. For over a year I have been fighting these "failing I/O due to too many SCSI reservations" errors. Multiple support tickets, multiple attempted fixes later, and I am still seeing them come in. I adjusted the queue depths, upgraded all fw and BIOS on both the host systems and the HBA's, double checked SAN path policies, blah blah...no change in the results. We get these scsi reservation errors multiple times a day randomly on the hosts.

Today I noticed something interesting. We had several P2V servers (Windows 2000 SP4) that were migrated to the virtual environment about two years ago. Physically, they were all multi-proc servers, but when P2V'd, they were single CPU VMs. It appears that whoever migrated these did not change the HAL back from multiproc to uniproc in the OS. I know it is a particular pain to change from multiproc to single proc in Windows 2000...an inplace upgrade is probably what is needed, but that's another story.

When checking the esxtop statistics, I noticed on these VMs that need the HAL adjusted, the %CPU ready times are enormous. Some stay constantly over 10 and even exceed 60. This is what made me check the HALs in the first place. So my question is...would these high CPU ready times be causing (or contributing to) some of my SCSI reservation errors?

0 Kudos
4 Replies
weinstein5
Immortal
Immortal

I would upgrade to U5 - as I understand VMware hase made improvents in to SCSI subsytem and how reservations are made and released -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
mark_chuman
Hot Shot
Hot Shot

Do you have round-robyn enabled? Can you check cache utilization on the SAN? Does this occur at any certain time? Do you SAN settings match up with VMware recommended SAN settings per your SAN infrastructure? We battle SAN problems more frequently that we would like. Disabling round robyn (long story on why we were using it) did the trick for us. Also, we were hit with other consumers on the SAN (non-esx) running huge disk to disk backups or huge processing jobs etc..

How long do these SCSI "storms" last? As you have probably heard from VMware the existence of reservation errors is not a problem, but the amount. What's the impact from these events (hung VMs etc..)?

mark_chuman
Hot Shot
Hot Shot

In my opinion high CPU wait times would not contribute to SCSI reservations in the logs. SCSI reservations should only be needed during certain "events", such as powering on a VM, VMotioning a VM. This KB lists the events that cause locks, which lead to SCSI reservations - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100500....

I would drill down in the I/O metrics in esxtop for those VMs if you have concern about them.

0 Kudos
IB_IT
Expert
Expert

sorry for the late response here. Yes, the amount is what is causing concern. For any given host in the cluster, I will see scsi reservation conflicts several times a week. I have tried a few different queue depth lengths which also does not appear to help, no matter which depth is set. I wonder if this is caused by the SANs that are attached? I have one san that needs a "fixed" path policy, and another san that needs "MRU" enabled. Would this cause all the fuss?

Going forward we have a plan to move to one SAN, so I guess I could wait and see when we move off the other two SANs if this is still an issue.

Message was edited by: IB_IT Sorry, to clarify, I see several times a week the "failing IO due to too many reservation conflicts"...not just the scsi reservation conflicts as I stated above.

0 Kudos