iSCSI and Host disconnect issues

aaronb123 · ‎05-30-2009

Hi,

Hope the VMware community can offer some help with this...

Environment:

I have an ESX 3.5 Cluster that consists of 6 Hosts. All hosts are at the same patch level, 3.5 U4 plus critical and security patches. The Cluster is HA and DRS enabled and has approximately 60 VMs spread evenly accross. Each host has the exact same hardware, HP DL380 G5s with 32 Gigs of ram and are SAN attached via a QLogic QLA4052c that is on an isolated LAN. SAN consists of two Netapp 3040s. There are approximately 10 x 500 Gig LUNs presented to each of the 6 ESX servers. Half are FC disks and half are SATA disks.

Backups are performed mostly via Netbackup 6.5.2 Agents on the VMs (yes, I am working on eliminating these ) and a few VCB only.

Problem:

About every 2-4 weeks one of the ESX server (seems random to me) will become disconnected from the SAN. I can still ping the SC and putty into the ESX but the ESX server appears disconnected in vCenter as well as all the VMs on that host. While puttied into the host I can execute commands but cannot browse "/vmfs/volumes". The vmkwarning log contains the following: "WARNING: FS3: 4787: Reservation error: Timeout" and I can see multiple other similar errors (I'll post log below). I can reboot the host with a "reboot -f" and it boots up normally. So, obviously, the host is not able to access a LUN and times out. But only one random host amoung the 6 host in that cluster.

I've contacted VMware support multiple times and got different versions of "we dont know" each time, so, now I'm lost... I hope that the VMware community can help, because so far, our Platinum Level VMware Supoort has failed me.

What I have done thus far:

Reinstalled every host from scratch using VMware Best Practices and Netapp Best Pratices. Also, I completely uninstalled all HP CIM agents off the servers. Its important to note the following. I have one other cluster (2 hosts) at the same data center, using the same SAN and the hosts are the same hardware as the cluster mentioned above. Also! I have a 4 host cluster at a different datacenter but it uses the exact same server hardware and Netapp SAN. Neither of the other clusters exerpience these issues.

What I'm thinking:

I originally thought it was an issue with my ESX scripted install. But since all the hosts have been manually reinstalled I'm starting to think that it is backup related. The backups were running when this most recent incedent occured. But unfortunately due to the extented time frame between the incidents, I cannot check if the backups were running last time this happened.

Copy of VMKWARNING log

May 30 18:38:38 ESX_Server_Name vmkernel: 4:08:45:14.918 cpu0:1041)WARNING: Fil3: 1787: Failed to reserve volume f530 28 1 49aee17a 6052f1f5 21004ed9 ea7bd55a 0 0 0 0 0 0 0

May 30 18:38:41 ESX_Server_Name vmkernel: 4:08:45:17.485 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:38:41 ESX_Server_Name vmkernel: 4:08:45:17.485 cpu5:1043)WARNING: FS3: 4981: Reclaiming timed out heartbeat failed: Timeout

May 30 18:39:21 ESX_Server_Name vmkernel: 4:08:45:57.487 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:40:01 ESX_Server_Name vmkernel: 4:08:46:37.487 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:40:38 ESX_Server_Name vmkernel: 4:08:47:14.970 cpu0:1040)WARNING: ScsiDevice: 3264: Failed for vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020: Timeout

May 30 18:40:38 ESX_Server_Name vmkernel: 4:08:47:14.970 cpu0:1040)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020" from Plugin "legacyMP" failed. Timeout

May 30 18:40:41 ESX_Server_Name vmkernel: 4:08:47:17.488 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:41:18 ESX_Server_Name vmkernel: 4:08:47:54.972 cpu0:1040)WARNING: ScsiDevice: 3264: Failed for vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020: Timeout

May 30 18:41:18 ESX_Server_Name vmkernel: 4:08:47:54.972 cpu0:1040)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020" from Plugin "legacyMP" failed. Timeout

May 30 18:41:18 ESX_Server_Name vmkernel: 4:08:47:54.972 cpu0:1040)WARNING: Fil3: 1787: Failed to reserve volume f530 28 1 49aee17a 6052f1f5 21004ed9 ea7bd55a 0 0 0 0 0 0 0

May 30 18:41:21 ESX_Server_Name vmkernel: 4:08:47:57.488 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:41:44 ESX_Server_Name vmkernel: 4:08:48:20.819 cpu7:1156)WARNING: VSCSIFs: 426: scatter-gather says length 0, op says 4096

May 30 18:41:46 ESX_Server_Name vmkernel: 4:08:48:22.559 cpu4:1156)WARNING: VSCSIFs: 426: scatter-gather says length 0, op says 4096

May 30 18:41:48 ESX_Server_Name vmkernel: 4:08:48:24.408 cpu4:1156)WARNING: VSCSIFs: 426: scatter-gather says length 0, op says 4096

May 30 18:41:58 ESX_Server_Name vmkernel: 4:08:48:34.974 cpu0:1040)WARNING: ScsiDevice: 3264: Failed for vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020: Timeout

May 30 18:41:58 ESX_Server_Name vmkernel: 4:08:48:34.974 cpu0:1040)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020" from Plugin "legacyMP" failed. Timeout

May 30 18:42:01 ESX_Server_Name vmkernel: 4:08:48:37.488 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:42:38 ESX_Server_Name vmkernel: 4:08:49:14.976 cpu0:1040)WARNING: ScsiDevice: 3264: Failed for vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020: Timeout

May 30 18:42:38 ESX_Server_Name vmkernel: 4:08:49:14.976 cpu0:1040)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020" from Plugin "legacyMP" failed. Timeout

May 30 18:42:38 ESX_Server_Name vmkernel: 4:08:49:14.976 cpu0:1040)WARNING: Fil3: 1787: Failed to reserve volume f530 28 1 49aee17a 6052f1f5 21004ed9 ea7bd55a 0 0 0 0 0 0 0

May 30 18:42:41 ESX_Server_Name vmkernel: 4:08:49:17.490 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

May 30 18:43:18 ESX_Server_Name vmkernel: 4:08:49:55.006 cpu0:1040)WARNING: ScsiDevice: 3264: Failed for vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020: Timeout

May 30 18:43:18 ESX_Server_Name vmkernel: 4:08:49:55.006 cpu0:1040)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02000e000060a98000486e544f584a4e66734c4a6e4c554e202020" from Plugin "legacyMP" failed. Timeout

May 30 18:43:21 ESX_Server_Name vmkernel: 4:08:49:57.491 cpu5:1043)WARNING: FS3: 4787: Reservation error: Timeout

Any help that you could offer would be most appreciated.

Aaron B

AndreTheGiant · ‎05-30-2009

For NetApp reservation issue there are some similar discussions.

As you say the problem could be related with backup operation. How many concurrent job are you doing in NetBackup?

Netbackup 6.5.2 Agents on the VMs

Do you know that NetBackup has also a native integration solution for VCB?

http://www.symantec.com/connect/forums/netbackup-vcb-setup-install-guide

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

aaronb123 · ‎06-01-2009

Thanks for the info Andre

Anyone else have similar experiences?

aaronb123 · ‎06-05-2009

In interest of sharing.

After about 2 months of searching I believe I found the reason for the ESX server iSCSI Storage disconnections. My iSCSI switches were set to Jumbo Frames (9000 MTU).... My ESX server iSCSI HBAs were set at normal frames (1500 MTU).

I believe this caused the disconnects in my environment. Especially when the backups were running. So far so good but I think I'll need to make a sacrifice to the iSCSI Gods.

MrBungle256 · ‎06-09-2009

Hi Aaron,

I am having exactly the same issue as you with one of our hosts.

We are running an Infortrend iSCSI SAN with QLA4052c cards in Dell 2950s. The hosts are connected to the SAN via unmanaged Giga switches.

I was suspecting that it might be to do with patch levels on the host. I tried building the server with 3.5U3 and patching all the way to just before 3.5U4. The problem persisted though.

Unlike your problem, mine only occurs on one host.

It would seem the one common factor is the HBA....

I am going to have a chat with Q-Logic.

FMCUSystemAdmin · ‎06-09-2009

aaron,

We experienced an event very similar to yours just last week.

We have about 25 VMs on a three host cluster of IBM x3650s all using QLogic dual-port 4062 iSCSI HBAs connected to a NetApp 2050 via a pair of Cisco 2960Gs. We've been running U2 on all three hosts for over 100 days. Out of the blue around 5:30am last Tuesday the oldest of the three drops from vCenter, three VMs go dead, and another VM on the same host records "Disk Controller failure" in the event log for about 40 seconds. Looking back in the logs, the same VM records hundreds of NTFS errors three days earlier. The other two hosts connected to the same LUNs don't see anything. The vmkwarning log from the host that failed looks just like yours.

We've had a support case open with VMware all week with no resolution so far. I've emailed the SE with a link to this discussion.

Thank you for posting this issue.

aaronb123 · ‎06-09-2009

Our situations are similar.

Were your backups running at that time? VCB backups?

I am glad to say that since setting the MTU to 9000 (Jumbo Frames) on the ESX hosts to match the iSCSI switched as well as balancing my VMs to about 8 per 500 gig LUN (minimize SCSI Reservation issues), I have not had the storage disconnects.

NB: There is no mention of Jumbo Frames in the iSCSI Best Pratices that I could find.

A wise Instructor told me many many years ago (I'm getting Old..) to "Always start from the ground up." In this case, it was a networking issue. Just take a few minutes and double check all your networking settings, MTU, Duplex, etc. You never know, it might be related to your issues.

Hope that helps.

Aaron B

FMCUSystemAdmin · ‎06-09-2009

Aaron,

Thanks for the reply.

I’ve been the only person to ever work on our SAN and I’ve never changed the MTU sizes since Jumbo frames are not “officially” supported for iSCSI in ESX 3.5.

Having said that, I have been checking the adapters and switches and so far everything appears normal.

None of the switch interfaces have recorded any runts, giants, bad frames, or other input errors.

dwilliam62 · ‎06-22-2009

FYI: Jumbo Frames with the ESX 3.5 SW iSCSI initiator isn't officially supported by VMware. However, when using the Qlogic 40xx iSCSI HBA, they are supported. The HBA alone handles the iSCSI/Network traffic, outside of the VMware network stack.

-don