joergriether
Hot Shot
Hot Shot

esx 4 on nehalem vkernel suddenly went down, all iSCSI gone, unresponsive in vsphere center

Dear Group,

yesterday we encountered something really bad. One of our nehalem esx4 (latest patches) machine with a software iscsi initiator, target is equallogic, suddenly went totally offlline with the vkernel ip adress. In addition, the machine became unresponsive in vsphere center, but the main ip adress was still pingable.

the vmkernel logs shows interesting infos, take a look:

at 13:45 all was OK but then suddenly at 14:42 when the handler "world 6303/2" was started (what is this handler???) the catastrophe begun.

I hat to hard reset the esx machine to get online again.

Any ideas?

best,

Joerg

Aug 27 13:45:25 esx7 vmkernel: 6:22:51:50.681 cpu0:4111)FSS: 3647: No FS driver claimed device '4a16ef91-b7c8669d-ac80-002219ccd2a1': Not supported Aug 27

14:42:50 esx7 vmkernel: 6:23:49:15.841 cpu5:6303)ScsiCore: 95: Starting taskmgmt handler world 6303/2 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:3 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess ISID: 00023d000001 TARGET: iqn.2001-05.com.equallogic:0-8a0906-35e2f2304-bf3000000524a951-eql4-esx-lowpriovol3 TPGT: 1 TSIH: 0 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn CID: 0 L: 172.16.150.131:56447 R: 172.16.150.222:3260 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess ISID: 00023d000001 TARGET: iqn.2001-05.com.equallogic:0-8a0906-4f32f2304-e21000000574a951-eql4-esx-lowpriovol1 TPGT: 1 TSIH: 0 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn CID: 0 L: 172.16.150.131:58041 R: 172.16.150.223:3260 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:1 L:0 : Task mgmt "Abort Task" with itt=0x133c23 (refITT=0x133c20) timed out.

Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.6090a04830f2324f51a97405000010e2" due to Not found Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.6090a04830f2324f51a97405000010e2": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.6090a04830f2324f51a97405000010e2" is blocked. Not starting I/O from device.

Aug 27 14:42:55 esx7 vmkernel: 6:23:49:20.779 cpu7:4207)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6090a04830f2324f51a97405000010e2" - issuing command 0x410007148540 Aug 27 14:42:55 esx7 vmkernel: 6:23:49:20.779 cpu7:4207)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device

0 Kudos
17 Replies
chrisy
Enthusiast
Enthusiast

I've seen something similar; a host that when moved to a particular model of switch (for iSCSI) suffered kernel problems and became unmanageable. After a powercycle it was OK as long as iSCSI was not activated, at which point the whole server fell off the management network as well as iSCSI, and the service console became very sluggish and unresponsive. This was vSphere, fully patched.

Interestingly increasing the frame size on the switch, from 9000 to a slightly larger size, seemed to fix it. I suspect that it's fixed the symptoms but not the root cause - whatever that is!

0 Kudos
beyondvm
Hot Shot
Hot Shot

Can you give a screenshot of your network configuration on this host?

Are you trying to use jumbo frames with this interface?

---

If you found any of my comments helpful please consider awarding points for "Correct" or "Helpful". Thanks!!!

www.beyondvm.com

--- If you found any of my comments helpful please consider awarding points for "Correct" or "Helpful". Thanks!!! www.beyondvm.com
0 Kudos
joergriether
Hot Shot
Hot Shot

no jumboframes.

Its a Testystem with 4 1GIG Broadcom NICS and one 10GIG Intel fiber NIC. Hardware is DELLR710.

There is only one vswitch with everything on it. The 5 NICS are all plugged to this vswitch. No further special configs applied.

Standard Testsystem Scenario besides the fact that the iscsi has no dedicated vswitch but i know from the past that

this is not really causing SUCH an issue.

best,

Joerg

0 Kudos
pauliew1978
Enthusiast
Enthusiast

I have got this error message before and just logged a support call with vmware about it.

It happens because of storage being taken offline and the luns not being rescanned in vmware. It doesnt happen in ESX 3 or 3.5 but in 4 basically it scans the luns every half an hour and if you have a lun that doesnt exist anymore the esx server becomes unresponsive.

the only way to clear it is to rescan the lun or reboot the esx server

0 Kudos
joergriether
Hot Shot
Hot Shot

i have reproduced the issue. It is like that: If you take away a lun in a iscsi software-initiator-esx4 environment, no matter if there was nothing on it, and no matter if all the other luns are still connected, the esx will crash. That is a bad one and YES, i never experienced this behaviour with esx 3 and 3.5, too.

VMware responded to my supportcall that my network config was "not optimal" but they couldnt find the real reason. I think i just found it.

VMware has to do sth.

best,

Joerg

0 Kudos
M_B_-_NS
Contributor
Contributor

Hi,

Any updates on this very critical issue ? It really needs to be fixed asap.

Regards,

Mathieu

0 Kudos
pauliew1978
Enthusiast
Enthusiast

I got fobbed off by support about it. No doubt they will release a fix at some juncture. The support persons I dealt with was not very useful in fact I felt rather put out as one of his messages said "I haven't got time to deal with this"!!!! I had some outstanding snapshots which are now commited and he asked me to add one patch to esx regarding writeback cache probs if the batter runs out. I have not done this part yet but I KNOW this is not the issue. I am pretty peeved as I would consider this pretty awful support. I have long been a preacher of vmware but this support is truly awful.

0 Kudos
admin
Immortal
Immortal

Hi Joerg,

ESX doesn't support all path down (no access) to any LUN. Which means if you have a LUN presented from iSCSI (SW and HW) or FC SAN to the ESX server and in case of any failure to this LUN VMs may crash.

LUN failure can be 1) a hardware failure on the storage array 2) LUN taken Offline or 3) failure of all paths to the LUN from ESX server. But please notice that the ESX server will not crash in such scenario and you will be able to connect to the console, only the VMs might fail and you will loose access to the datastores created using the SAN LUN.

Regards

Sudhish

0 Kudos
admin
Immortal
Immortal

Hi Joerg,

If the above comment was informative and answered your question please mark this thread as answered.

Regards

Sudhish

0 Kudos
M_B_-_NS
Contributor
Contributor

As I reproduced the issue, I can answer... losing VM in a non-reachable LUN is NOT the problem.

We are aware (who isn't ?) that losing all paths to a LUN crashes the VM.

The issue is ESX crashes itself.

Of course it is not supposed to do this, that's why it is a critical issue.

0 Kudos
admin
Immortal
Immortal

Can you please make it clear what do you mean by ESX crashed? If its a slow response or failure of VM operations (running on other LUNs) then that is expected in ESX 4.

If you see the ESX server hung or panic due to the all path down to any lun please upload the logs to analyze.

0 Kudos
admin
Immortal
Immortal

Log snippet uploaded in Comment#1 by Joerg reports the iSCSI connectivity failure to the LUN.

Regards

Sudhish

0 Kudos
M_B_-_NS
Contributor
Contributor

The issue is exactly the same : I use an EqualLogic SAN and got the same log messages.

It was purposefully reproduced, so it had only 2 VM on the ESX and no VM on the LUN being discarded. After I made the LUN unavailable at the SAN level, the ESX became totally unresponsive after a while.

After reboot, I confirmed in the vmkwarning log that I had the same behavior described in the first post.

Why is it expected that VM on others LUN should be impacted ?

0 Kudos
admin
Immortal
Immortal

Please upload the vmkernel logs collected immediately after the server reboot.

Regards

Sudhish

0 Kudos
KingFridayXII
Contributor
Contributor

I have the exact same issue on my new deployment of 3 ESX servers (Dell R610s) with two Intel dual-port 10Gb NICs connected to an Equallogic SAN. Any serious IO on the servers and they become totally unresponsive , with similar log messages as above. All server are fully patched 4.0U1. I have a case open with VMware, but no resolution yet. Things I have tried base on A LOT of reading on the issue:

- Put each iSCSI vmnic on a seperate vSwitch, and only have 2 vmk ports on each switch. Also tried 1:1.

- Switch from Round-robin MPIO to fixed path (slight improvement, failure only brings one NIC/vSwitch down so I have time to migrate guest off and reboot).

- Update ixgbe driver to latest 2.0.44

- Make netPktHeapMinSize and netPktHeapMaxSize larger

- enabled NetQueue and VMDQ support for the adapters

- remove the group from Dynamic discovery

No dice....

0 Kudos
KingFridayXII
Contributor
Contributor

M.B - NS, Joerg,

Do you guys have your SR #s with VMware?? I would like to pass them along to my tech support engineer. Hopefully we can get some collaboration within VMware tech support to speed the resolution of this very critical issue. My SR# is: 1492590841

0 Kudos
joergriether
Hot Shot
Hot Shot

Hi there,

i HAD a #SR with vmware. They told me my network config "was not optimal". They said it is not good to mix up 10GBE with other GB ports. Well, I did that in the past for YEARS without any issue. In other words I guess that (VMwares answer) means: We have absolutely no idea why this is happening to you and we never ever saw it before.

And to add something: I even can reproduce it when there is ONLY the 10GBE Port connected.

Now, seems to me really clear now that other users are experiencing this issue also.

regards

Joerg

0 Kudos