MKguy
Virtuoso
Virtuoso

SCSI-timeout bug in ESX4 VMware Tools for kernels below 2.6.13 (e.g. RHEL4)

So I recently filed a service request with the VMware support because of the following issue:

The SCSI timeout value in /sys/block/sdX/device/timeout is not increased through installing the latest ESX4 VMware tools on RHEL4 VMs. It remains at the default 30 seconds, while the value on RHEL5/CentOS5 VMs is being increased from default 60 to 180 seconds by installing the VMware tools.

No big deal you might say, but that cost us some unnecessary downtime of RHEL4 (and only RHEL4) VMs when we had problems with the SAN, while all other RHEL5 and Windows VMs ran just fine. Of course we can easily set this value manually, but isn't that (at least my expectation) the job of the VMware tools, just like they do fine on Windows and RHEL5 VMs?

Whatever; As it turned out during various tests and the exchange of emails with the VMware support, this is due to a missing udev rule (99-vmware-scsi-udev.rules) in /etc/udev/rules.d/ on these RHEL4 VMs after installing the tools.

99-vmware-scsi-udev.rules

#
# VMware SCSI devices Timeout adjustment
#
# Modify the timeout value for VMware SCSI devices so that
# in the event of a failover, we don't time out.
# See Bug 271286 for more information.
#
# Note: The Udev systems vary from distro to distro.  Hence all of the
#       extra entries.

# Redhat systems
ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S",   RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

The reason why this is missing on all RHEL4 VMs is, according to the support, that a Linux Kernel of at least 2.6.13 (which apparently contained udev related updates) is required for this to work. However, all RHEL4 releases as well RedHat Kernel updates for RHEL4, are based on the 2.6.9 kernel.

The support stated that they have no intention to fix this (like automatically setting the SCSI timeout with some other way) or to release a KB article to point out this potential problem to the public.

Yes, as stated above, this might not be that big of a deal and I know I can easily set the timeout values myself, but what bothered me a bit was the reluctance of the support to actually fix this someday or at least plan for a KB article with an official recommendation to set this value manually on certain systems.

I mean, I can't be the first one to stumble upon this issue, can I? Has anyone else seen this and accordingly contacted VMware support?

-- http://alpacapowered.wordpress.com
0 Kudos
6 Replies
MKguy
Virtuoso
Virtuoso

In case anyone cares:

Since we were not too fond of using rc.local to set the value at every boot, we contacted RedHat support which suggested trying the following udev rule:

>KERNEL="sd*", SYSFS="VMware", SYSFS="Virtual Disk", NAME="%k", PROGRAM="/bin/sh -c 'echo 180 > /sys/block/%k/device/timeout'"

This works without issues for us on a couple of RHEL4 boxes.

Still a bit disappointing that a suggestion like that (or just any suggestion) didn't even come from VMware themselves.

-- http://alpacapowered.wordpress.com
0 Kudos
darniellec
Contributor
Contributor

Did you see any issues once your timeouts were in place with udev not building the actual block device file for your RHEL 4 hosts?

I had posted in December 2009 the following below. Same rough concept but I am using netapp's provided gos scripts that write udev rules that tweak the timeout value. We are still ESX 3.5 and using SAN disk presently.

== Quote - netapps gos timeout scripts -Dec 1, 2009==

I'm curious if anyone has seen any side effects for those that have run the applicable Linux & Solaris gos timeout scripts from netapps. It's my understanding that vmware tools in vsphere is also implementing a similar setting.

NetApp Solution ID: kb41511 - VMware ESX Guest OS I/O Timeout Settings for NetApp Storage Systems

I'm noticing that the rhel4_gos_timeout-install.sh script seems to break the ability of the gos to dynamically create a device file in /dev for a hot added virtual disk. This is not the case for the rhel5_gos_timeout-install.sh script.

I also noticed I seem to get the following entry in /var/log/messages when trying to rescan for a newly added virtual disk with the netapps timeout setting in place.

Dec 1 16:40:55 bobo hald30837: Timed out waiting for hotplug event 594. Rebasing to 594

If I remove the files this script creates and restart udev things go back to working again.

Chad

0 Kudos
Texiwill
Leadership
Leadership

Hello,

Good information, I have as a moderator submitted this to be a KB. Not sure when that will happen but this is very important information. I think VMware is not going to 'fix' this because RHEL4 is an older OS but since it is fixed for RHEL5 they feel its already fixed. Not sure I agree with them there as there are plenty of RHEL4 VMs out there still.


Best regards,
Edward L. Haletky VMware Communities User Moderator, VMware vExpert 2009, 2010

Now Available: 'VMware vSphere(TM) and Virtual Infrastructure Security'[/url]

Also available 'VMWare ESX Server in the Enterprise'[/url]

Blogging: The Virtualization Practice[/url]|Blue Gears[/url]|TechTarget[/url]|Network World[/url]

Podcast: Virtualization Security Round Table Podcast[/url]|Twitter: Texiwll[/url]

--
Edward L. Haletky
vExpert XIII: 2009-2021,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
MKguy
Virtuoso
Virtuoso

Thanks for picking this up and getting it to VMware, Texiwill.

darniellec:

>Did you see any issues once your timeouts were in place with udev not building the actual block device file for your RHEL 4 hosts?

Nope, I haven't seen any issues with our own (or rather RedHat provided) udev rules on our RHEL4 VMs.

-- http://alpacapowered.wordpress.com
0 Kudos
admin
Immortal
Immortal

Hi,

We have a related Kb article where we recommended this in the KB article.

http://kb.vmware.com/kb/1009465

These values should increase by deploying VMware Tools. If not you will have to increase this manually.

Thanks,

Anand

0 Kudos
dtasj
Contributor
Contributor

I was having issues with RHEL6.

I was able to figure it out with lssci.  This is the udev command that VMware tools installs:

ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

After running lssci I saw this:


[root@ ~]# lsscsi -c
Attached devices:
Host: scsi1 Channel: 00 Target: 00 Lun: 00
  Vendor: NECVMWar Model: VMware IDE CDR10 Rev: 1.00
  Type:   CD-ROM                           ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Target: 00 Lun: 00
  Vendor: VMware   Model: Virtual disk     Rev: 1.0
  Type:   Direct-Access                    ANSI SCSI revision: 02

And discovered that these 2 identifiers were the issue:

BUS=="scsi"

and

SYSFS{model}=="VMware Virtual S"

so the command that I needed was this one:

ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware  " , ATTRS{model}=="Virtual disk    ",   RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

which is the command for the Debian system!

In my particular case for RHEL6 it wasnt working because I had commented out the Debian and SUSE lines to avoid the errors that I was seeing at boot time.  This KB article outlines it well:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102389...

The KB article is incorrect, because it says:

--snip--

For Redhat systems, change the line:

ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

To:

ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware " , SYSFS{model}=="Virtual disk ", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

--snip--

but in reality

BUS=="scsi"

needs to be replaced with:

SUBSYSTEMS=="scsi" in order for it to work, and the SYSFS{model}=="Virtual disk " needs the additional spaces after disk.

And contrary to the KB article, you dont have to reboot.  Just run /sbin/start_udev

0 Kudos