Cluster loosing access to datastores

vanree · ‎09-13-2010

Hi,

We have a very strange problem for a while now and recently the issue is getting more of a problem.

We have 2 IBM X servers, both are now running ESX 4.1, and QNAP NAS with 16TB storage in RAID6.

Before we upgraded the 2 servers from 4.0 to 4.1 we already had the problem that after a reboot we had to re-add storage and reconnect all the 5 datastores (iSCSI, each on its own Target and LUN, size 1.9TB) on each ESX server seperately (we could use the same signature) and all would run fine (HA and DRS both on and EVC mode on Intel Xeon Core 2). We could move VMs between datastores and between hosts smoothly.

After the upgrade of the first server (ESX01) to version 4.1, we have to problem that ESX02 cannot connect to the 5 iSCSI datastores anymore.The Path show up, but adding the datastore does not work, it wants to format the disk (the other 2 options are greyed out).This is obviously not a good idea, since many of our VMs are on these datastores. We also noticed on the NAS box that the ESX01 has 1 connection to the NAS, but the ESX02 has 3 connections going.

We did some more testing and we could reverse the connection issue, by creating a datastore on ESX02 (1 connection to the NAS) and Rescanning on ESX01, this datastore is available, but ESX01 now has 3 connections to the NAS.

The hardware infrastructure might be part of the problem. Between the NAS and the two ESX boxes we have a LinkSys SRW2016 (16 port GigaBit swich). Each of these 3 devices have 2 Teamed (hash) NICs connecting to the switch and the ports in the switch have been programmed to pair with LACP. The communication seems to work fine, we never had data corruption so far.

When I try to reconnect the the datastore on our ESX02 server, this is what I find in the messages log:

Sep 14 03:11:02 ESX02 vobd: Sep 14 03:11:02.965: 255076495735us: http://vob.scsi.scsipath.add Add path: vmhba33:C0:T5:L0.

Sep 14 03:11:02 ESX02 vobd: Sep 14 03:11:02.966: 255076496687us: http://vob.scsi.scsipath.add Add path: vmhba33:C0:T6:L0.

Sep 14 03:11:03 ESX02 kernel: http://254798.743109 Vendor: QNAP Model: iSCSI Storage Rev: 3.1

Sep 14 03:11:03 ESX02 kernel: http://254798.750489 Type: Direct-Access ANSI SCSI revision: 05

Sep 14 03:11:03 ESX02 kernel: http://254798.816357 SCSI device sdc: 209715201 512-byte hdwr sectors (107374 MB)

Sep 14 03:11:03 ESX02 kernel: http://254798.823157 sdc: Write Protect is off

Sep 14 03:11:03 ESX02 kernel: http://254798.830611 SCSI device sdc: drive cache: write through

Sep 14 03:11:03 ESX02 kernel: http://254798.837828 SCSI device sdc: 209715201 512-byte hdwr sectors (107374 MB)

Sep 14 03:11:03 ESX02 kernel: http://254798.844354 sdc: Write Protect is off

Sep 14 03:11:03 ESX02 kernel: http://254798.851547 SCSI device sdc: drive cache: write through

Sep 14 03:11:03 ESX02 kernel: http://254798.857780 sdc: sdc1

Sep 14 03:11:03 ESX02 kernel: http://254798.858189 sd 4:0:4:0: Attached scsi disk sdc

Sep 14 03:11:03 ESX02 kernel: http://254798.864430 sd 4:0:4:0: Attached scsi generic sg3 type 0

Sep 14 03:11:03 ESX02 kernel: http://254798.870761 Vendor: QNAP Model: iSCSI Storage Rev: 3.1

Sep 14 03:11:03 ESX02 kernel: http://254798.877105 Type: Direct-Access ANSI SCSI revision: 05

Sep 14 03:11:03 ESX02 kernel: http://254798.890493 SCSI device sdi: 209715201 512-byte hdwr sectors (107374 MB)

Sep 14 03:11:03 ESX02 kernel: http://254798.897087 sdi: Write Protect is off

Sep 14 03:11:03 ESX02 kernel: http://254798.903969 SCSI device sdi: drive cache: write through

Sep 14 03:11:03 ESX02 kernel: http://254798.910657 SCSI device sdi: 209715201 512-byte hdwr sectors (107374 MB)

Sep 14 03:11:03 ESX02 kernel: http://254798.917180 sdi: Write Protect is off

Sep 14 03:11:03 ESX02 kernel: http://254798.924377 SCSI device sdi: drive cache: write through

Sep 14 03:11:03 ESX02 kernel: http://254798.930618 sdi: unknown partition table

Sep 14 03:11:03 ESX02 kernel: http://254798.930965 sd 4:0:10:0: Attached scsi disk sdi

Sep 14 03:11:03 ESX02 kernel: http://254798.937243 sd 4:0:10:0: Attached scsi generic sg9 type 0

I guess the highlighted line is part of the problem and again this same datastore is in full opertion on the ESX01 box.

My question is what are we doing wrong and why are we not able to connect to the datastore from one box and why would these ESX servers have 3 connections for 1 datastore?

Thanks, Edwin

marcelo_soares · ‎09-13-2010

I think that on this particular case you will need to rebuild the partition table for this datastore. Use information on http://kb.vmware.com/kb/1002281

Basically:

fdisk /dev/sdi

w

In any point you can type q to exit without saving and start the process again.

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares

vanree · ‎09-13-2010

Hi Marcelo,

Thanks for your quick response. The problem is getting more weird. Last night here in South Australia, I was looking at your solution and getting myself console access to the freshly installed ESX02 box, so I could do some investigating first. I wanted to compare the info with what ESX01 is reporting, so I also accessed ESX01 via PuTTy. The minute I did that ESX01 crashed with a purple screen of death, wow. Never seen an ESX box crash before, especially not with all my production VMs on it. Now the interesting part is that the ESX02 box suddenly could access all the previously inaccessible LUN! So I moved all the VMs over to the ESX02 box and all was running soon again.

This morning I did the same firmware and RAID controller upgrades on the ESX01 IBM server and did a fresh ESX 4.1 install. Double checked all configuration and the tried to activate the iSCSI and find the LUNs. Now this ESX01 box has exactly the same problem as the other box had before: LUNs cannot be mounted (wants to format the disk).

What I can conclude is that the problem is not located in one ESX installation. So I removed the NIC teaming and tried connecting the LUNs again, did not make a difference. Again the ESX01 box show up in the NAS with 3 connections and ESX02 with only 1 connection. I reconnected the second NICs in all the 3 boxes (NAS, ESX01 and ESX02) again. Now my conclusion is that maybe the NAS is doing something wrong. There is a firmware upgrade, which I will try next, but I have to safeguard around 5TB of data before I can do this and move some of our productions VMs to the local IBM RAID storage first.

The NAS is a QNAP TS-809U-RP Turbo NAS, VMware approved. Current firmware version: 3.2.3 Build 0209T. There are some newer versions with some iSCSI updates, so that is my next best hope!

Thanks again for your help and if you have more ideas, let me know. I will post the result later.

Cheers, Edwin

marcelo_soares · ‎09-14-2010

Hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

Seems that Qnap is not presenting the LUNs in the right way. Tell me what happen after the firmware upgrade.

Also, tell me a bit how do you connect from the ESX boxes to the NAS. Do you specify different IPs on each ESX configuration, or they share the same targets?

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares

vanree · ‎09-15-2010

I think we found the cause of the problem.

Yesterday I upgraded the firmware of the QNAP NAS without any trouble. Still the same problem though, one ESX box has 2 connections and the other has 3 connections to each LUN target. After a bit of research on the net I found other people having similar issues.

The multiple connections are now solved, by removing the Advanced ACL LUN Masking. We had 2 Policies, one for each ESX box allowing them to access the LUNs and we made the default Policy block access to the LUNs. I now changed the default policy back to allow all and removed the 2 server policies. So something is not handled correctly when these policies are used.

On the ESX front we still have the 5 original LUNs only connect to ESX02 at the moment, but I could create new LUNs, which are accessible by both boxes and I am now moving all datastores over to these new LUNs (lucky we have some space left on our NAS). So far this looks good, but again the final prove is when we reboot both ESX boxes. When I created these new LUNs I rebooted ESX01 and that was very happy, it found the new LUNs automatically now.

I also noticed during moving of the VMs to other datastores on ESX02 (which have access to all LUNs), that when we select the datastore the new LUNs show under Access header: Multiple hosts. The "broken" LUNs show Single host. Not sure if it does this as a result of the LUNs not being mounted by multiple hosts or that there is an setting on the volume saying this cannot be used by multiple hosts.

Thanks again for taking the time to think with me!

Cheers, Edwin

vanree · ‎09-18-2010

This week we had the tech guys of QNAP look at our system, because ESX02 still looses a LUN, especially during the move of large VMs.

The QNAP has been updated to the latest firmware without a problem and we found a logging feature of all iSCSI events, like connect and disconnect and found an interesting issue. Whenever ESX02 looses a LUN it starts to connect using the username "System" instead of the prescribed full iSCSI qualifier. When we change some of the settings in the iSCSI connection on ESX and do a rescan most of the times the LUN is coming back online, because ESX is using the correct full qualifier instead of the System.

We are still in the process of moving all our data to the on both ESX systems working LUNs,when finished tomorrow, I will remove the broken LUNs and we have to wait and see if the connection problem completely disappears, othewise we have to contact VMware support, because it looks like a bug in ESX at this stage.

Cheers, Edwin

vanree · ‎09-30-2010

Thought give you the last update of our solved problem.

We are now running 2 weeks after I managed to move all Datastores over to the newly created ones. Not without hickups thou, every now and then the ESX02 server would loose connection to one of the datastores and tried to login to the NAS with user: System. Change something in the iSCSI settings, then scan for changes and the datastore would come back online.

The 2 ESX servers are running ok since with the new datastores, so prevuously created datastores must have had some VMware inflicted corruption.

Cheers, Edwin

All

Cluster loosing access to datastores