VMware

This Question is Not Answered

1 "correct" answer available (10 pts) 2 "helpful" answers available (6 pts)
1 2 Previous Next 17 Replies Last post: Mar 8, 2010 10:35 AM by joergriether  

esx 4 on nehalem vkernel suddenly went down, all iSCSI gone, unresponsive in vsphere center posted: Aug 28, 2009 3:15 AM

Click to view joergriether's profile Hot Shot 198 posts since
Sep 17, 2006
Dear Group,

yesterday we encountered something really bad. One of our nehalem esx4 (latest patches) machine with a software iscsi initiator, target is equallogic, suddenly went totally offlline with the vkernel ip adress. In addition, the machine became unresponsive in vsphere center, but the main ip adress was still pingable.

the vmkernel logs shows interesting infos, take a look:
at 13:45 all was OK but then suddenly at 14:42 when the handler "world 6303/2" was started (what is this handler???) the catastrophe begun.

I hat to hard reset the esx machine to get online again.

Any ideas?

best,
Joerg

Aug 27 13:45:25 esx7 vmkernel: 6:22:51:50.681 cpu0:4111)FSS: 3647: No FS driver claimed device '4a16ef91-b7c8669d-ac80-002219ccd2a1': Not supported Aug 27

14:42:50 esx7 vmkernel: 6:23:49:15.841 cpu5:6303)ScsiCore: 95: Starting taskmgmt handler world 6303/2 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:3 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess --ISID: 00023d000001 TARGET: iqn.2001-05.com.equallogic:0-8a0906-35e2f2304-bf3000000524a951-eql4-esx-lowpriovol3 TPGT: 1 TSIH: 0-- Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn CID: 0 L: 172.16.150.131:56447 R: 172.16.150.222:3260 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess --ISID: 00023d000001 TARGET: iqn.2001-05.com.equallogic:0-8a0906-4f32f2304-e21000000574a951-eql4-esx-lowpriovol1 TPGT: 1 TSIH: 0-- Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn CID: 0 L: 172.16.150.131:58041 R: 172.16.150.223:3260 Aug 27 14:42:53 esx7 vmkernel: 6:23:49:19.105 cpu2:4239)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:1 L:0 : Task mgmt "Abort Task" with itt=0x133c23 (refITT=0x133c20) timed out.

Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.6090a04830f2324f51a97405000010e2" due to Not found Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.6090a04830f2324f51a97405000010e2": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

Aug 27 14:42:54 esx7 vmkernel: 6:23:49:19.778 cpu3:5991)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.6090a04830f2324f51a97405000010e2" is blocked. Not starting I/O from device.

Aug 27 14:42:55 esx7 vmkernel: 6:23:49:20.779 cpu7:4207)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6090a04830f2324f51a97405000010e2" - issuing command 0x410007148540 Aug 27 14:42:55 esx7 vmkernel: 6:23:49:20.779 cpu7:4207)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device

Click to view chrisy's profile Enthusiast 60 posts since
Apr 27, 2005
I've seen something similar; a host that when moved to a particular model of switch (for iSCSI) suffered kernel problems and became unmanageable. After a powercycle it was OK as long as iSCSI was not activated, at which point the whole server fell off the management network as well as iSCSI, and the service console became very sluggish and unresponsive. This was vSphere, fully patched.

Interestingly increasing the frame size on the switch, from 9000 to a slightly larger size, seemed to fix it. I suspect that it's fixed the symptoms but not the root cause - whatever that is!
Click to view beyondvm's profile Hot Shot 132 posts since
Jul 24, 2009

Can you give a screenshot of your network configuration on this host?

Are you trying to use jumbo frames with this interface?

---
If you found any of my comments helpful please consider awarding points for "Correct" or "Helpful". Thanks!!!

www.beyondvm.com

Click to view pauliew1978's profile Hot Shot 227 posts since
Dec 5, 2006

I have got this error message before and just logged a support call with vmware about it.

It happens because of storage being taken offline and the luns not being rescanned in vmware. It doesnt happen in ESX 3 or 3.5 but in 4 basically it scans the luns every half an hour and if you have a lun that doesnt exist anymore the esx server becomes unresponsive.

the only way to clear it is to rescan the lun or reboot the esx server


Click to view M.B - NS's profile Novice 10 posts since
Jun 3, 2008

Hi,

Any updates on this very critical issue ? It really needs to be fixed asap.

Regards,

Mathieu

Click to view pauliew1978's profile Hot Shot 227 posts since
Dec 5, 2006

I got fobbed off by support about it. No doubt they will release a fix at some juncture. The support persons I dealt with was not very useful in fact I felt rather put out as one of his messages said "I haven't got time to deal with this"!!!! I had some outstanding snapshots which are now commited and he asked me to add one patch to esx regarding writeback cache probs if the batter runs out. I have not done this part yet but I KNOW this is not the issue. I am pretty peeved as I would consider this pretty awful support. I have long been a preacher of vmware but this support is truly awful.
Click to view sudhishpt's profile Enthusiast VMware Employees 44 posts since
Jan 6, 2008
Hi Joerg,

ESX doesn't support all path down (no access) to any LUN. Which means if you have a LUN presented from iSCSI (SW and HW) or FC SAN to the ESX server and in case of any failure to this LUN VMs may crash.

LUN failure can be 1) a hardware failure on the storage array 2) LUN taken Offline or 3) failure of all paths to the LUN from ESX server. But please notice that the ESX server will not crash in such scenario and you will be able to connect to the console, only the VMs might fail and you will loose access to the datastores created using the SAN LUN.

Regards
Sudhish
Click to view sudhishpt's profile Enthusiast VMware Employees 44 posts since
Jan 6, 2008
Hi Joerg,

If the above comment was informative and answered your question please mark this thread as answered.

Regards
Sudhish
Click to view M.B - NS's profile Novice 10 posts since
Jun 3, 2008

As I reproduced the issue, I can answer... losing VM in a non-reachable LUN is NOT the problem.

We are aware (who isn't ?) that losing all paths to a LUN crashes the VM.

The issue is ESX crashes itself.

Of course it is not supposed to do this, that's why it is a critical issue.

Click to view sudhishpt's profile Enthusiast VMware Employees 44 posts since
Jan 6, 2008
Can you please make it clear what do you mean by ESX crashed? If its a slow response or failure of VM operations (running on other LUNs) then that is expected in ESX 4.

If you see the ESX server hung or panic due to the all path down to any lun please upload the logs to analyze.
Click to view sudhishpt's profile Enthusiast VMware Employees 44 posts since
Jan 6, 2008
Log snippet uploaded in Comment#1 by Joerg reports the iSCSI connectivity failure to the LUN.

Regards
Sudhish
Click to view M.B - NS's profile Novice 10 posts since
Jun 3, 2008
The issue is exactly the same : I use an EqualLogic SAN and got the same log messages.
It was purposefully reproduced, so it had only 2 VM on the ESX and no VM on the LUN being discarded. After I made the LUN unavailable at the SAN level, the ESX became totally unresponsive after a while.

After reboot, I confirmed in the vmkwarning log that I had the same behavior described in the first post.

Why is it expected that VM on others LUN should be impacted ?

Click to view sudhishpt's profile Enthusiast VMware Employees 44 posts since
Jan 6, 2008
Please upload the vmkernel logs collected immediately after the server reboot.

Regards
Sudhish

VMware Beta Programs

Want to be Considered for Future Beta Programs?

Learn More

VMware Developer

Download SDKs, APIs, videos,
training, and more in the Developer community.

Learn More

Developer
Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld
Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

Only VMware ... Delivers Nexus 1000V

Ensure consistent, policy-based network capabilities to virtual machines across your data center.

Learn More

Communities