VMware Cloud Community
erpomik
Contributor
Contributor

Warning: Crashed iSCSI MPIO LUNs on Synology after upgrade to vSphere 6

Hi

This is meant as a warning to other VMware users, before they run into the same kind of problems, that we did when upgrading to vSphere 6.

We have been running a setup like you see on the picture below for a long time, without any problems. The setup is build up on three DELL R620 servers and two Synology RackStations - One RS3412RPxs and one RS3614RPxs.

H5 Hal8 iSCSI MPIO.png

But after we upgraded from vSphere 5.5U2 to vSphere 6.0 our SANs and LUNs started crashing. We even experienced 2 disks that died during the 14 days we have been fighting the problem. Both Synos had high CPU usage, high memory load and very often did not respond on neither web interface or SSH. Two times during the past 14 days, the LUNs crashed so hard, we had to make a dissaster recovery.

Just this night, we finally found the root cause: Path-A and Path-B was placed in the same broadcast domain (the same IP subnet). Even though I know this is not Best Practices, this have never caused problems, when running vSphere 5.0 and 5.5.

I'm sorry, but I need to correct the statement above, because our Synology LUNs has crashed again, even though the two iSCSI paths were separated into two VLANs.

What we see is this:

When the ESXi host starts up, we see two paths to each LUN on the Syno with LUN IDs of 0, 1, 2, etc. But after a while, the host has quadrupled the paths to each LUN with LUN IDs like 0/256/512/768, 1/257/513/769, etc. And when this happens, all the trouble starts. We now have five dead disks (three HHDs and two SSDs) and one dead RS3614RPxs as a result of this problem. I don't know how it's possible for the Syno to destroy a disk in this situation, but this is what actually happens. And on one of our RS3614RPxs, even the internal flash card has crashed!

One theory we have, is that ESXi 6.0 has a more aggressive iSCSI policy, and if the Syno is not responding fast enough, ESXi tries to create a new "ghost" path with LUN ID n + 256 and so on.

During the weekend we downgraded the three hosts to ESXi 5.5U2, and everything is working stable again. No "ghost" LUNs gets created at any time.

So I am NOT saying, that this is a bug in neither vSphere or DSM, only that vSphere 6 seems to be incompatible with DSM 5.2 update 2.

Update 2015-06-25:

While I was still wondering what could possible be the reason for this problem to occur, I took a look at the Configuration Maximums for vSphere 5.5 and vSphere 6.0. And guess what? Maximum LUN ID has been raised from 255 to 1023 (8 bits vs. 10 bits respectively). My guess now is, that Synology DSM does not support LUN IDs higher than 255.

Sorry if this is not the right place for such a warning. I just had an urge to write about it. Please feel free to share is information on any media 🙂

The best of luck,

Ernst Mikkelsen (VCP5)

Trifork A/S

0 Kudos
0 Replies