Forcing device status from SDSTAT_GOOD to SDSTAT_B...

Bosco · ‎05-15-2008

We have an incident where a there is a disk failure in our central stroage on EVA5000 which cause the filesystems on some of the RedHat AS4 based VM hosts to read only, and below is an example of log entry we saw in a number of ESX servers (running VM Infrastructure 3.02) which last for about 20 seconds,

May 12 06:02:19 xxx vmkernel: 8:23:05:11.951 cpu0:1028)LinSCSI: 2610: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

We've opened a case with VMware and they said that this is because ESX lost communication with the storage which is normal behaviour. We then open up a case with HP on the EVA side, and they said the disk outage should NOT affect the server so nothing wrong at it's end except replace the disk.

As we do not see this happen on the Windows or Solaris hosts, I am wondering why this will impact the Linux hosts? and why it do not recover when ESX restored the storage connection? Our only way to recover is to reboot the VM hosts. Has anyone experience this as well? Is there anything we can do to prevent this from happening again?

BenConrad · ‎05-15-2008

We've seen our Linux boxes go RO as well.

Check out the following link, a newer Kernel fixes this isuse in some distros

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=51306&slic...

We've implemented the following (attached) script for Debian and Gentoo, it sets the disk timeout to 120 seconds....

All

Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY