VMware Cloud Community
rb78
Contributor
Contributor

iSCSI on QNAP with Multipath results in data corruption on file system

Dear all,

 

I've have a strange behavior on our storage configuration and ESX 6.7U3 (Build 17713310).

We have a QNAP 1677XP 16-bay multi tier storage which is configured like the following:

  • It contains two LUNs on one stoarge pool.
  • The storage is attached via LACP Trunk to a 10 GBit Switch 
  • The storage is accessable via IP 10.1.1.10/16 (one IP because of the trunk)

We have two Dell ESX Servers (Both 6.7 Dell Image 17167734) configured identically:

Network config:

  • Jumbo Frames are active on ESX-Servers and the QNAP
  • All network connections which has somethink todo with the storage are on one port based VLAN.

In the attachment you can find an image of the setup.

 

The storage contains two LUNs each 22,5TB. The first LUN is in use and contains VMs. The second LUN is empty and actually for migration/updates in use.

 

So the problem is: When I copy some VM files or migrating from LUN0 to LUN1 or vice versa the file system on the destination gets corrupt! Every time!

I can reproduce it every time in the configuration. I deleted the datastore of LUN1 in ESX an create it again, do a filesystem check -> file system is ok -> I copy 300Gb -> file system is corrupt.

 

What I checked:

  1. All HDDs on the QNAP (Long test and SMART Info twice) -> All OK
  2. The protocols of the Switch (HP 6600ml) -> no Tx/Rx errors -> no port errors

 

What I try this week:

  1. I activated iSCSI Header- and Data-Digest and do a test run (=I recreated a datastore => filesystem is ok => I copied 300GB => filesystem is corrupt)
  2. I deactivated our Load-Balancing path so that only on 10GBit Path from each host is connected to the storage and do a test run ((=I recreated a datastore => filesystem is ok => I copied 1,5TB => filesystem is NOT corrupt). I tried to deactivate the other both path and it works too. See attachment
  3. I leave all paths active and set in the multipath configuration one path to one destination “Fixed” and do a test run (=I recreated a datastore => filesystem is ok => I copied 300GB => file system is corrupt)

So the problem occur when more than one path is active to the iSCSI target. I think after all may checks and tests that the hardware seems to be ok. Now I'm really perplexed and don't know what is wrong and how to solve it - maybe you has a tip for me.

 

Best regards

 

Rainer

 

 

Reply
0 Kudos
3 Replies
davidemiccone
Contributor
Contributor

Hello,

have you understood the origin of the problem?

I need to setup a similar configuration and I wish to know if the problem has been resolved.

Reply
0 Kudos
rb78
Contributor
Contributor

Hey,

 

the problem still exists. We did several things in the last months:

  • Update QNAP Firmware
  • Changed all Network-Cables
  • Changed all Transceivers
  • Changed all 10GBit Network-Cards
  • Removed Trunking from Switch to QNAP NAS
  • Removed MultiPath in ESX. 
  • Both last points means there is only network line from ESX Host to Switch and one Line from Switch to QNAP. It's the simplest possible configuration

 

There are no errors on QNAP, on the RAIDs, on the Hard Disk (I did long term test) and not errors on the Switch (HP 6600). 

 

I have two LUNs on the storage and therefore I can migrate the data to a LUN where the file system is ok. It runs for two weeks and then it crashes with around (the last error) 400 errors.

 

I know that this problem has nothing todo with the QNAP SSD-Cache (which we deactivated). 

 

We have a second QNAP storage with 4x14TB HDDs in a RAID-5 configuration for backup our VMs. This NAS is attached to both ESX hosts via iSCSI too (only one network cable, no trunking, no multi-path aso). There exists no multi-tiering on that storage. We copy around 90TB's of data within the last weeks an we don't have any problems at the moment. Ok, this storage is only for backup and used two days a week but it is the same file system format (VMFS-6). 

 

I can imagine that maybe the problem is the multi-tiering but I cannot say it definitely that is the root cause. 

 

On last monday we removed our second LUN and created an NFS storage (NFS 3.0 with VAAI plugin - NFS 4.1 with VAAI plugin doesn't run successfully - the plugin version 3.2-001 seems to be buggy). I hope that this solves the problem because the file system handling will be done by QNAP directly. 

 

I created a ticket ar VMWare and at QNAP but I think they cannot help me. In my opinion our hardware is ok (the hardware which we removed too).

 

Best regards


Rainer

Reply
0 Kudos
davidemiccone
Contributor
Contributor

Thank you for you long reply.

It is very disappointing that QNAP can't help you 😞

Reply
0 Kudos