VMware Cloud Community
jmikolajek
Contributor
Contributor

ESX host and storage on a SAN

We currently have 8 ESX servers hosting approximately 40 vm guests in a VM cluster. an IBM DS4400 SAN hosts the storage for this VM environment. We have roughly 10 luns configured as storage. Here is the issue we are seeing and would like to know if anyone has expereinced this. We have added a couple of new ESX hosts and migrated an existing host - of which rebuilidng was part of the process. Each time we reboot an ESX host (so far it has only been the newlybuilt or rebuilt host) or perform a port scan of the hba's our whole virtual environment will experience what I describe as a "hiccup" in terms of the environments access to the SAN. We will see a group of servers (unable to identify a pattern as to which group) that when accessing the console will show - no OS found. Reseting the guest resolves the issue. We will also find a number of guests with IO errors in the event logs - rebooting them clears that error. We do not see any issues with non-VM servers that access the SAN nor do we see any indication of any problem on the SAN - no errors reported in any logs etc....

Any help I can get would be greatly appreciated - not sure, but suspect the SAN configuration for these hosts may be off or something else needing to be configured on ESX server - but unsure why it would affect the other ESX hosts...... Everything "looks" correct.....

Jim

0 Kudos
5 Replies
mvoss18
Hot Shot
Hot Shot

Welcome to the forums.

I've seen this before and It's funny you mention how when you went beyond 8 hosts you started to see this issue. I've read many times that when you go above 8 hosts in a cluster you really start to see a lot of SCSI reservation errors. The particular SAN has to be able to handle that many hosts being to access that many LUNs simultaneously. I suspect that your IBS DS4400 (which is discontinued by IBM) probably can't handle more than 8 hosts in a cluster.

http://communities.vmware.com/thread/185201

I'm betting if you look in your logs you'll see a lot of SCSI reservation errors. Send those logs to VMware support.

You might also:

-check that your multi-pathing setup is correct for that SAN (MRU vs Fixed)

-make firmware on hosts, hba and SAN equipment (including switches) is up to date

-set the BIOS in your hbas to "point to point" only

If you still cannot go beyond 8 hosts, you might consider creating a new cluster with your two servers.

0 Kudos
RParker
Immortal
Immortal

I've read many times that when you go above 8 hosts in a cluster you really start to see a lot of SCSI reservation errors

That's because few people heed the documentation closely, there is a setting for making the queue depth LOWER for the more hosts you have, which will manage the SCSI reservations better, but with 1 HOST on 1 LUN you will STILL get SCSI reservations. Every open, close, operation makes a 'reservation' request. It's more noticeable with more hosts because it's more time sensitive, but 8 isn't the magic number, it depends on paths from those hosts as well.

So for each VMDK, on EACH LUN, on EACH path on EACH host, is a file request. So the more VMs you have and then you add hosts makes it more likely to have SCSI reservations.

0 Kudos
RParker
Immortal
Immortal

We will also find a number of guests with IO errors in the event logs - rebooting them clears that error. We do not see any issues with non-VM servers that access the SAN nor

do we see any indication of any problem on the SAN - no errors reported in any logs etc....

Consider modifying the HBA settings from default, and make the queue depth half of what it is now. That may help some. ISO files attached to VM's, the number of VMDK also contribute to these reservations as well, and since each guest has logs, that also may make more traffic on your SAN, so you don't the logs turn those off as well.

0 Kudos
mvoss18
Hot Shot
Hot Shot

I've seen this on an high end SAN (HP XP12000) as well once the number of hosts and VMs started to get grow significantly. VMs were hanging and needed a reset and showed all kinds of disk errors. And SCSI reservation errors were coming out like crazy. To fix the #1 problem and that was the VMs hanging, the solution was actually two things:

-Follow the vendors recommendation to stop using LUSE LUNs

-Update firmware and set the Host Mode option (this was the largest contributor to the problem).

Once the Host Mode option was set we stopped seeing these issues, but SCSI reservation errors never completely went away.

0 Kudos
JohnADCO
Expert
Expert

The only way I know of to reduce reservation issues (and this dedpond onthe san model) is to reduce the number of VM's per lun. On the heaviest VM's we give the hard hit data stor a dedicated LUN.

We really are not seeing any reservation issues, and we use cheapo iSCS sans.

0 Kudos