VMware Cloud Community
aysabtu
Contributor
Contributor

Problem w/IFT Eonstore box and ESXi4

I'm the process of setting up an ESXi 4.0 installation w/a mini SAN including an Eonstore 24F-G2430 FC to SATA RAID

box. I'm running into 3 problems related w/IO to the Eonstore.

Problem1: migrating a certain Linux VM w/20G a virtual disk from a VMFS partion residing on a internal RAID partition (datastore1)

onto the SAN reproducibly fails with a noninformative error msg 'Error caused by file ...' Trying to copy it over

using Veam SCP also fails.

Problem2a: installed a new 64 bit RHEL 5.4 VM directly onto the RAID box wo problems. Starting to rsync data over to the new VM, it crashed (froze up)

with console messages indicating problems with the virtual SCSI system disk. This is reproducible.

Problem 2b: after rebooting it, it was tried to rsync a data actually transferred (4.5G) within the VM. This causes exactly the same problems.

Both the server (Dell R710) and the Eonstore box should be OK with ESXi4. Does somebody know what might go wrong / have experience w/

ESX4 and Eonstore boxes? I would love to look into the detailed vmkernel messages, but haven't been able to find how to find them within the

vSphere client.

-- Michael

Environment:

Storage box: Infortrend EonStore A24F-G2430 FC to SATA RAID box running Firmware 3.64R.01

SAN switch Qlogic SANbox 5602 FC Switch, ActiveSWVersion V6.7.0.4.0

Server: Dell R710 2x4 core Nehalem 2.27 Ghz w/a ISP2432 based 4Gb Fibre Channel to PCI Express HBA

ESX release: ESXi 4.0, Build 164009 No patches

VMFS file system: 1.91T LUN, VMFS 3.33 Blocksize 8M

Path selection: Fixed(VMware), array type> VMW_SATP_DEFAULT_AA

Reply
0 Kudos
9 Replies
AndreTheGiant
Immortal
Immortal

Problem1:

How do you migrate the Linux guest?

The VM is really powered off?

Problem2a/b:

How do you make the rsync?

The VM is really powered off?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
Reply
0 Kudos
aysabtu
Contributor
Contributor

Problem1: I right-click it in the vSphere client and select'Migrate'. It is definitely

powered off.

Problem 2a/b: I rsync from the linux command-line (with the VM naturally being powered on) from the local virtual

disk (residing on the Eonstore box) to the local virtual disk with the machine freezing after some minutes.

When I do the same operation in the original VM residing on internal disks in the R710, the machine does not freeze.

I ffound out later yesterday that the VM ecovers after some 20 minutes spewing out a lot of error messages indicating that the virtual SCSI controller

has been hung.

As I encounter problems both in the vSphere client and within linux on the running VM, I tend to believe that the

problems arise from the interaction between vmkernel and the RAID box.

- Michael

Reply
0 Kudos
aysabtu
Contributor
Contributor

Finally managed to get access to the VMkernel log.

It show repeated cases of the following two lines:

Oct 1 07:30:25 vmkernel: 63:19:54:52.083 cpu13:14447844)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100060c4400) to NMP device "naa.600d0230ffffffff0286181f3a0c0800" faied on physical path "vmhba2:C0:T0:L0" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Oct 1 07:30:25 vmkernel: 63:19:54:52.083 cpu13:14447844)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600d0230ffffffff0286181f3a0c0800" state in doubt; requeted fast path state update...

I have no reason to expected sytematic hw failure in the EonStore box / SAN switch.

-- Michael

Reply
0 Kudos
jguidroz
Hot Shot
Hot Shot

Not sure if this is an issue, but the ISP2432-based 4GB Fibre Channel to PCI Express HBA is currently not on the HCL for vSphere 4. I have 10 servers, from a 2950, R900, and R710 with the same adapter. Can you post which driver the ISP2432 HBA is using?

Thanks.

christianZ
Champion
Champion

Can you paste your esx/storage config page here?

Are you using lun masking in your Eonstor lun mapping?

How many fc connection have you there between your storage and fc switch?

How many fc hbas have you in your Esx host?

Have you tried to install a Windows vm?

Reg

Christian

Reply
0 Kudos
aysabtu
Contributor
Contributor

Not sure if this is an issue, but the ISP2432-based 4GB Fibre Channel to PCI Express HBA is currently not on the HCL for vSphere 4. I have 10 servers, from a 2950, R900, and R710 with the same adapter. Can you post which driver the ISP2432 HBA is using?

I think it uses the qlogic driver:

~ # grep ISP /var/log/messages

Oct 4 18:47:46 vmkernel: 0:00:00:31.680 cpu10:4564)<6>qla2xxx 0000:04:00.0: Found an ISP2432, irq 193, iobase 0x0x4100b3e02000

Oct 4 18:47:48 vmkernel: ISP2432: PCIe (2.5Gb/s x4) @ 0000:04:00.0 hdma+, host#=4, fw=4.04.09 [Multi-ID

Oct 4 18:47:48 vmkernel: 0:00:00:34.174 cpu10:4564)<6>qla2xxx 0000:04:00.1: Found an ISP2432, irq 201, iobase 0x0x410009fbe000

Oct 4 18:48:19 vmkernel: ISP2432: PCIe (2.5Gb/s x4) @ 0000:04:00.1 hdma+, host#=5, fw=4.04.09 [Multi-ID

-- Michael

Reply
0 Kudos
aysabtu
Contributor
Contributor

I'm the process of setting up an ESXi 4.0 installation w/a mini SAN including an Eonstore 24F-G2430 FC to SATA RAID

box. I'm running into 3 problems related w/IO to the Eonstore.

I've now looked carefully into the configuration pages on the EonStore box and changed the following parameters:

Max Number of Concurrent Host-LUN Connection

Was default 4, changed to 64. This change was motivated by the fact that the box shares

4 LUNs on another FC controller with a bare-metal LINUX server and currently 1 LUN

with the ESX server. The change did not have immediate any effect but I did not change it back.

Number of Tags Reserved for each Host-LUN Connection

Was default 4, changed to 32. 4 seemed low. The change did not have any immediate effect.

But then I booted the ESX server -- I had only put it into maintenance mode while rebooting

the EonStore box. Now I can rsync from the Linux test VM local filesystem to local filesystem

w/o problem. The test which reproducibly hung the VM now have succeed 10 times

without problems. This means that Problem 2 seems to be gone.

Tomorrow, I'll try to migrate the production VM from local disks to SAN (Problem1) and see

if it works too.

Though it would have been nice to see exactly which EonStore parameter change

did the job, I can't allow for more service interruptions to the physical Linux server.

There's also the possibility that ESX had gone sick. I tend to belive that that's not the

reason -- it seem to behave nicely when I did not rsync data in the test VM.

-- Michael

Reply
0 Kudos
asp24
Enthusiast
Enthusiast

Any updates here?

I'm also seeing the same error messages: "state in doubt; requested fast path state update..." etc. I'm also using an Infortrend box (iscsi/sas)

Did the error messages go away? I see that I also have "_Number of Tags Reserved for each Host-LUN Connection"_ set to 4. I will change this, but I have to schedule downtime first.

aysabtu
Contributor
Contributor

Any updates here?

I'm also seeing the same error messages: "state in doubt; requested fast path state update..." etc. I'm also using an Infortrend box (iscsi/sas)

Did the error messages go away? I see that I also have "Number of Tags Reserved for each Host-LUN Connection" set to 4. I will change this, but I have to schedule downtime first.

Things works fine now. I just grepped for the message in the VMkernel logs and didn't find any matching lines. I think the error message have been discussed elsewhere in the forums and probably indicates that

VMkernel gets unexpected busy states from the RAID box. It's interesting that we see the problem both with iscsi and FC. This points towards the problem being a generic issue with the IFT firmware and

not bound the the FC medium or SATA disk.

Btw, the reason I chose to bump the value to 32 was that I found

http://docs.sun.com/source/817-3711-10/ch08_configparam.html#_Toc521744092

which looks like SUN in the past have OEM'ed the IFT hard/firmware. And they state a default value of 32.

Do restart both ESX and the RAID box -- I was lazy at first and took ESX to maintenance mode while booting the box which did not cure the problem.

- Michael

Reply
0 Kudos