VMware Cloud Community
chriswizemen
Contributor
Contributor

ESXi 4.1, md3000i, Path Redundancy failures

I inherited a vsphere 4.1 environment attached to two MD3000i's. One of the md3000i's functions flawlessly, and both did before I made changes to the second.

Last month, I decommissioned the 10 LUNs served by the second SAN. I created 5 new LUNs, and served them to my cluster.

The datastore shows up, has paths over each of the SAN vmks, and looks fine.

This month, I created new guests living on the datastore. When there is any sort of IO in the guest on this datastore, I get an email of the following.

([Event alarm expression: Lost Storage Connectivity] OR [Event alarm expression: Lost Storage Path Redundancy] OR [Event alarm expression: Degraded Storage Path Redundancy])

Occasionally I also receive:

Issue detected on esx03 in datacenter: ScsiDeviceIO: 2368:Failed write command to write-quiesced partition naa.60024e800070282800004dd04f7c6bf0:1

(34:06:21:27.996 cpu15:4438724)

The host logs:

Jun  7 07:26:43 esx03 vmkernel: 34:06:21:27.996 cpu15:4438724)Fil3: 1035:  Sync WRITE error ('') (ioFlags: 16) : IO was aborted
Jun  7 07:26:43 esx03 vmkernel: 34:06:21:27.997 cpu4:6566)Fil3: 1035:  Sync READ error ('.fbb.sf') (ioFlags: 😎 : IO was aborted by VMFS via a virt-reset on the device
Jun  7 07:26:43 esx03 iscsid: Kernel reported iSCSI connection 9:0 error (1017) state (3)

Jun  7 07:26:43 esx03 vmkernel: 34:06:21:28.717 cpu7:4103)ScsiDeviceIO: 1688: Command 0x2a to device "naa.60024e800070282800004dd54f7c6cbe" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Jun  7 07:26:43 esx03 vmkernel: 34:06:21:28.717 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027efd8340) to NMP device "naa.60024e800070282800004dd54f7c6cbe" failed on physical path "vmhba33:C0:T1:L45" H:0x2 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.

(plenty more of the same, for multiple LUNs)

After receiving that error, I can see visible path failures in the VIC. Shortly, within say 4 minutes, those errors clear up, and all the pathing registers as happy. The guests running on that datastore don't seem to notice any blips or problems.

VMware support said talk to my storage vendor.

Dell support diagnosed a failed controller [0,1], and replaced it.

The problems continue.

I have walked through the Dell md3000i setup guide for using iscsi and ESXi to verify my configuration.

I have continuous vmkpings between all hosts and the san controllers.

I have the current firmware for the san controllers.

I have the current 4.1 patches installed on all hosts.

I have switched hosts, switched paths, tried both MRU and RR, and the errors continue.

I'm hoping for community sugggestions, or maybe other dell md3000i customers who have run into similiar can tell me what I'm missing.

Thanks.

Reply
0 Kudos
1 Reply
chriswizemen
Contributor
Contributor

I continue to see people reading this post, so I thought it best to follow up.

I have purchased a NetApp, and have resolved my open ticket with VMware.

Working with VMware and Dell, the following diagnostic steps were performed:

Changing switch configurations.

Changing switch firmware.

Changing VMware versions.

Changing host mapping.

Changing firmware on the MD3000.

Changing controllers on the MD3000.

Changing cables to the attached MD1000.

Changing pathing from MRU to RR.

Deleting and recreating the datastores.

Deleting dead paths. Forcing path failure. Enabling/disabling Storage IO Control.

* There are probably more steps, but after 4 months of diagnosis, and working on multiple issues, I forget them all. Hopefully these notes help you anyways.

I can say definitively that none of them made any sigificant change to the behavior, which was 14-25 emails a day of path redundancy lost from VSphere.

ESXi 4.1u3 changes the timeout on iscsi to reduce the frequency of this error.

The other change, which I believe would significantly impact the performance/errors I am seeing, would be to revert from Raid6 to Raid5, on the storage backend. Unfortunately, with production data in place, I am unable to do that change at this time. Dell strongly recommends running Raid5 and not Raid6. Dell strongly recommends not building datastores across multiple aggregates (stick to less than 2TB volumes?!) Dell support insists  that the MD3000 with 4.1 does NOT have the same timeout/pathing issues as the  MD3200/3600 with VMware 5. My experience, and the log analysis done  by VMware, indicate the same exact timing issue occurs as documented.  Either way, Dell does not have any updates which result in a significant  change in the behavior.

The other important fact to share. I have two nearly identical MD3000i/MD1000 SAN's. Only one of them causes Path Redundancy warnings. Both receive "External IO Load" warnings, and both have path failures, but only one experiences the 30 second LUN disconnects which trigger the email notification.

Anyways, thanks. If anyone has any questions, feel free to follow up with me, I'm happy to answer more, but to the general population glancing at this issue, hopefully this gives someone less stress than I have experienced.

Reply
0 Kudos