Solved: Re: Temporary loss of SAN connectivity.

vmproteau · ‎06-15-2008

ESX 3.0.2, Virtual Center 2.5 Update 1

While I was on vacation, my company had a SAN problem where they had to reboot the Master Controller on our HP EVA 6100. I was told there was no problems with our ESX server environment connectivity and the built in SAN redundancy successfully kept the Hosts connected.

Now that I've had a chance to look this over more closely, I see that every VM on Hosts attached to this SAN show system log errors relating to disks when the Master Controller was power cycled. I've attched the system log events from the WIndows 2003 VMs. Each VM has the exact same entries at the exact same times. I've included the text of these events below.

My assumption is the OS handled the loss of connectivity as I haven't heard anything about the event or issues since then but, I am concerned with undelying unseen corruption. Can anyone offer insight into Windows VM behavoir when the Host they are on lose connectivity to the SAN? I understand that it would depend on what was going on each individual VM at the time of the problem but, would you expect a Windows 2003 OS to recover from a momentaty disk connectivity problem like this?

Event Type: Error

Event Source: symmpi

Event Category: None

Event ID: 15

Date: 6/6/2008

Time: 12:45:20 AM

User: N/A

Computer: S8270W202

Description:

The device, \Device\Scsi\symmpi1, is not ready for access yet.

Event Type: Error

Event Source: Disk

Event Category: None

Event ID: 11

Date: 6/6/2008

Time: 12:45:20 AM

User: N/A

Computer: S8270W202

Description:

The driver detected a controller error on \Device\Harddisk0.

jdvcp · ‎06-15-2008

By default on Windows 2003 (in a VM), the disk timeout is 60 seconds. You need to see what your hba timeout is set to at the hardware/firmware level. I forced our qlogic hbas to 15 seconds, so that failover at the path (or ESX) level would occur ideally after 15 seconds but before the 60 second timeout on the windows VM. If you hba is set to 30 seconds, that leaves only 30 more seconds for ESX to renegotiate paths to the VMFS in questions prior to the OS's at the VM guest level having issue. Did these VMs crash? Did they have any data corruption? If not, these errors may just be Windows saying that it is eating into it's disk timeout threshhold.

As far as corruption goes, I don't think you have to worry. Realistically, during the path failover, there was no place for any potentially bad data to get written to. It was cached at the Windows VM guest level, then dumped out to disk when it became available.

I have had total VMFS loss before. Basically, after 60 seconds, the Windows VMs crash. They came back up ok once the issue was resolved.

The only time I have seen disk corruption at the VM level was when my storage subsystem had physical corruptions on a few disk, causing RAID stripes to be inconsistent. I think you are ok...the environment seems to have reacted correctly.

View solution in original post

jdvcp · ‎06-15-2008

By default on Windows 2003 (in a VM), the disk timeout is 60 seconds. You need to see what your hba timeout is set to at the hardware/firmware level. I forced our qlogic hbas to 15 seconds, so that failover at the path (or ESX) level would occur ideally after 15 seconds but before the 60 second timeout on the windows VM. If you hba is set to 30 seconds, that leaves only 30 more seconds for ESX to renegotiate paths to the VMFS in questions prior to the OS's at the VM guest level having issue. Did these VMs crash? Did they have any data corruption? If not, these errors may just be Windows saying that it is eating into it's disk timeout threshhold.

As far as corruption goes, I don't think you have to worry. Realistically, during the path failover, there was no place for any potentially bad data to get written to. It was cached at the Windows VM guest level, then dumped out to disk when it became available.

I have had total VMFS loss before. Basically, after 60 seconds, the Windows VMs crash. They came back up ok once the issue was resolved.

The only time I have seen disk corruption at the VM level was when my storage subsystem had physical corruptions on a few disk, causing RAID stripes to be inconsistent. I think you are ok...the environment seems to have reacted correctly.

bolsen · ‎06-16-2008

My highly technical guess is you'll be fine. 😛

t35216 · ‎07-08-2008

Hi I am experiencing the same Events as the original poster. My dilemna is I am evaluating the esx infrastructure as a DR environment for a customer, the budget did not allow for failover hba's can you advise on best course of action. My data drive restores are failing when these events are generated.

All

Temporary loss of SAN connectivity.