VMware Cloud Community
JoergXB
Contributor
Contributor

x3650M3 w. ESXi5 looses connection to LUN's

Hi to all,

I hope someone can help me here.

It seems so that I'm affacted by a Bug desc. in this VMware KB Article. I followed the workaround but in my case it doesn't help. VMware Support told my only they would file my SR to the PR.

Don't know if I understand the problem right, but could it be a second option to disable VT-d option in Bios? Support couldn't answer to that question.

UEFI / HBA / Driver are the latest builds...

vmkernel.log:

011-12-30T08:13:45.820Z cpu8:5029)NMP: nmpStateLogger:3109: NMP module ID: c.

2011-12-30T08:13:45.820Z cpu8:5029)NMP: nmpStateLogger:3110: NMP heap ID: 0x410011dc0000.

2011-12-30T08:13:45.820Z cpu8:5029)NMP: nmpStateLogger:3111: NMP slab ID: 0x410005678100.

2011-12-30T08:13:45.820Z cpu8:5029)NMP: nmpStateLogger:3123: NMP is managing 6 devices:

2011-12-30T08:13:51.367Z cpu8:5029)VMKAcpi: 4110: \_SB.OSC evaluation: Result = 0x1, Flags =0x10

2011-12-30T08:13:51.370Z cpu8:5029)TPM FixedMem: start = 0xfed40000, end = 0xfed44fff, write protect = 0

2011-12-30T08:13:58.090Z cpu5:5030)NMP: nmpStateLogger:3109: NMP module ID: c.

2011-12-30T08:13:58.090Z cpu5:5030)NMP: nmpStateLogger:3110: NMP heap ID: 0x410011dc0000.

2011-12-30T08:13:58.090Z cpu5:5030)NMP: nmpStateLogger:3111: NMP slab ID: 0x410005678100.

2011-12-30T08:13:58.090Z cpu5:5030)NMP: nmpStateLogger:3123: NMP is managing 6 devices:

2011-12-30T08:14:03.623Z cpu3:5030)VMKAcpi: 4110: \_SB.OSC evaluation: Result = 0x1, Flags = 0x10

2011-12-30T08:14:03.626Z cpu3:5030)TPM FixedMem: start = 0xfed40000, end = 0xfed44fff, write protect = 0

2011-12-30T08:14:10.451Z cpu8:5033)NMP: nmpStateLogger:3109: NMP module ID: c. 2011-12-30T08:14:10.451Z cpu8:5033)NMP: nmpStateLogger:3110: NMP heap ID: 0x410011dc0000.

2011-12-30T08:14:10.451Z cpu8:5033)NMP: nmpStateLogger:3111: NMP slab ID: 0x410005678100.

2011-12-30T08:14:10.451Z cpu8:5033)NMP: nmpStateLogger:3123: NMP is managing 6 devices:

2011-12-30T08:14:14.275Z cpu0:5033)WARNING: APIC: 1839: APICID 0x00000000 - ESR = 0x40

2011-12-30T08:14:15.980Z cpu9:5033)VMKAcpi: 4110: \_SB.OSC evaluation: Result = 0x1, Flags = 0x10

2011-12-30T08:14:15.983Z cpu9:5033)TPM FixedMem: start = 0xfed40000, end = 0xfed44fff, write protect = 0

2011-12-30T08:14:26.897Z cpu5:2193)<6>Fusion MPT SAS Host:10:0:0:2 :: attempting task abort! scmd(0x4124014a4200)

2011-12-30T08:14:26.897Z cpu5:2193)Fusion MPT SAS Host:10:0:0:2 ::  <6>        command: Write(10): 2a 00 00 00 ac e0 00 00 01 00

2011-12-30T08:14:26.897Z cpu5:2193)<6>Channel :0 Id :0 :: handle(0x0009), sas_address(0x500a0b83a9907008), phy(0) 2011-12-30T08:14:26.897Z cpu5:2193)<6>Channel :0 Id :0 :: enclosure_logical_id(0x500605b004547150), slot(0)

2011-12-30T08:14:27.422Z cpu2:2050)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba3:0:0:2 (driver name: Fusion MPT SAS Host) - Message repeated 1 time

2011-12-30T08:14:27.422Z cpu8:2056)ScsiDeviceIO: 2305: Cmd(0x412400716b40) 0x2a, CmdSN 0x688 to dev "naa.600a0b80003a99070000026c4ee6e933" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2011-12-30T08:14:29.794Z cpu5:5084)ScsiCore: 63: Starting taskmgmt handler world 5084/2

2011-12-30T08:14:29.794Z cpu10:5084)<6>Fusion MPT SAS Host:10:0:0:0 :: attempting task abort! scmd(0x4124014b2600)

2011-12-30T08:14:29.794Z cpu10:5084)Fusion MPT SAS Host:10:0:0:0 ::  <6>        command: Write(10): 2a 00 00 00 ac e0 00 00 01 00

2011-12-30T08:14:29.794Z cpu10:5084)<6>Channel :0 Id :0 :: handle(0x0009), sas_address(0x500a0b83a9907008), phy(0) 2011-12-30T08:14:29.794Z cpu10:5084)<6>Channel :0 Id :0 :: enclosure_logical_id(0x500605b004547150), slot(0)

2011-12-30T08:14:39.803Z cpu7:5085)ScsiCore: 63: Starting taskmgmt handler world 5085/3

2011-12-30T08:14:39.803Z cpu7:5085)<6>Fusion MPT SAS Host:10:0:0:0 :: attempting task abort! scmd(0x41240141aa80)

2011-12-30T08:14:39.803Z cpu7:5085)Fusion MPT SAS Host:10:0:0:0 ::  <6>        command: Write(10): 2a 00 00 03 6a 85 00 00 01 00

2011-12-30T08:14:39.803Z cpu7:5085)<6>Channel :0 Id :0 :: handle(0x0009), sas_address(0x500a0b83a9907008), phy(0) 2011-12-30T08:14:39.803Z cpu7:5085)<6>Channel :0 Id :0 :: enclosure_logical_id(0x500605b004547150), slot(0)

2011-12-30T08:14:42.455Z cpu6:2054)ScsiDeviceIO: 2316: Cmd(0x412400716b40) 0x2a, CmdSN 0x688 to dev "naa.600a0b80003a99070000026c4ee6e933" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2011-12-30T08:14:42.499Z cpu6:2054)ScsiDeviceIO: 2316: Cmd(0x4124006f1300) 0x2a, CmdSN 0x689 to dev "naa.600a0b80003a99070000026c4ee6e933" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2011-12-30T08:14:45.808Z cpu2:5086)ScsiCore: 63: Starting taskmgmt handler world 5086/4

2011-12-30T08:14:56.900Z cpu5:2193)<3>mpt2sas0: mpt2sas_scsih_issue_tm: timeout <6>mf:      01000009 00000100 00000000 00000200 00000000 00000000 00000000 00000000       00000000 00000000 00000000 00000000 00000152

2011-12-30T08:15:00.676Z cpu0:5086)<6>Fusion MPT SAS Host:10:0:0:1 :: attempting task abort! scmd(0x412401353d40)

2011-12-30T08:15:00.676Z cpu0:5086)Fusion MPT SAS Host:10:0:0:1 ::  <6>        command: Log Sense: 4d 00 6f 00 00 00 00 00 ff 00

2011-12-30T08:15:00.676Z cpu0:5086)<6>Channel  Id  :: handle(0x0009), sas_address(0x500a0b83a9907008), phy(0)

Thanks for your help,

Joerg

I am a VM on the hypervisor called earth! A+ Certs, some MS Certs, VCP4, VCP5, hope VCAP 5 coming soon
Tags (4)
Reply
0 Kudos
8 Replies
PABB
Contributor
Contributor

... one of my clients has exactly the same problem. Cluster has been built with 2 x3650 M3 connected to the DS3524 storage array (SAS interface).

Solution from KB1030265 applied.

ESXi patched with ESXi500-201111001, ESXi500-201112001, mpt2sas-10.10.10.00.1vmw-offline_bundle-536503.

As you can see logs from vmkernel are very similar to yours:

<6>bnx2: vmnic4 NIC Copper Link is Up, 1000 Mbps full duplex
<6>bnx2: vmnic2 NIC Copper Link is Up, 1000 Mbps full duplex
2011-12-30T13:46:29.608Z cpu5:2099)NetPort: 1426: disabled port 0x1000003
2011-12-30T13:46:29.608Z cpu5:2099)Uplink: 5244: enabled port 0x1000003 with mac 5c:f3:fc:6a:6a:f0
2011-12-30T13:46:29.620Z cpu5:2099)NetPort: 1426: disabled port 0x2000002
2011-12-30T13:46:29.620Z cpu5:2099)Uplink: 5244: enabled port 0x2000002 with mac 00:10:18:be:b6:ec
2011-12-30T15:18:09.408Z cpu20:2068)VMW_SATP_LSI: satp_lsi_pathFailure:1119: Command 0x2a to naa.60080e500024f13400000d414ef4e1a4 (fcf 0) failed with NOT_READY (0x2/0x4/0x1), on path vmhba4:C0:T0:L1 (pnr 1, iet 0x16065d3)
2011-12-30T15:18:09.408Z cpu20:2068)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412440376b80) to dev "naa.60080e500024f13400000d414ef4e1a4" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x4 Possible sense data: 0x2 0x4 0x1.Act:NONE
2011-12-30T15:18:09.408Z cpu20:2068)ScsiDeviceIO: 2305: Cmd(0x412440376b80) 0x2a, CmdSN 0x800000d3 to dev "naa.60080e500024f13400000d414ef4e1a4" failed H:0x0 D:0x2 P:0x4 Possible sense data: 0x2 0x4 0x1.
2011-12-30T15:18:09.414Z cpu3:2051)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x412400d544c0) to dev "naa.60080e500024f13400000d414ef4e1a4" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x4 Possible sense data: 0x2 0x4 0x1.Act:NONE
2011-12-30T15:18:10.517Z cpu14:12552)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412440376b80) to dev "naa.60080e500024f13400000d414ef4e1a4" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x94 0x1.Act:FAILOVER
2011-12-30T15:18:10.517Z cpu14:12552)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60080e500024f13400000d414ef4e1a4": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2011-12-30T15:18:10.524Z cpu3:2051)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x412400d544c0) to dev "naa.60080e500024f13400000d414ef4e1a4" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x94 0x1.Act:FAILOVER
2011-12-30T15:18:10.624Z cpu12:22151)vmw_psp_mru: psp_mruSelectPathToActivateInt:346: Changing active path from vmhba4:C0:T0:L1 to vmhba3:C0:T0:L1 for device "naa.60080e500024f13400000d414ef4e1a4".
2011-12-30T15:18:10.624Z cpu17:2738)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60080e500024f13400000d414ef4e1a4" - issuing command 0x412440376b80
2011-12-30T15:18:10.636Z cpu19:2067)NMP: nmpCompleteRetryForPath:321: Retry world recovered device "naa.60080e500024f13400000d414ef4e1a4"
2011-12-30T15:18:10.997Z cpu15:2063)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412440ff9f40) to dev "naa.60080e500024f13400000d414ef4e1a4" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x94 0x1.Act:FAILOVER
2011-12-30T15:18:10.997Z cpu15:2063)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60080e500024f13400000d414ef4e1a4": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2011-12-30T15:18:11.624Z cpu17:2738)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60080e500024f13400000d414ef4e1a4" - issuing command 0x412440ff9f40
2011-12-30T15:18:11.625Z cpu14:2062)NMP: nmpCompleteRetryForPath:321: Retry world recovered device "naa.60080e500024f13400000d414ef4e1a4"
2011-12-30T16:09:21.920Z cpu0:2072)WARNING: APIC: 1839: APICID 0x00000000 - ESR = 0x40
2011-12-30T16:09:34.131Z cpu2:2242)<6>Fusion MPT SAS Host:13:0:0:2 :: attempting task abort! scmd(0x412440cf0ec0)
2011-12-30T16:09:34.131Z cpu2:2242)Fusion MPT SAS Host:13:0:0:2 ::
<6>        command: Write(10): 2a 00 00 00 af 70 00 00 01 00
2011-12-30T16:09:34.131Z cpu2:2242)<6>Channel  Id  :: handle(0x0009), sas_address(0x50080e524f080004), phy(0)
2011-12-30T16:09:34.131Z cpu2:2242)<6>Channel : 0 Id : 0 :: enclosure_logical_id(0x500605b004548890), slot(0)
2011-12-30T16:09:34.438Z cpu11:2059)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba4:0:0:0 (driver name: Fusion MPT SAS Host) - Message repeated 1 time
2011-12-30T16:09:34.438Z cpu11:2059)ScsiDeviceIO: 2305: Cmd(0x412400d811c0) 0x2a, CmdSN 0x800e0051 to dev "naa.60080e500024f08000000ae54ef4e109" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2011-12-30T16:09:36.154Z cpu11:24558)ScsiCore: 63: Starting taskmgmt handler world 24558/2
2011-12-30T16:09:36.154Z cpu11:24558)<6>Fusion MPT SAS Host:13:0:0:0 :: attempting task abort! scmd(0x412440cf2cc0)
2011-12-30T16:09:36.154Z cpu11:24558)Fusion MPT SAS Host:13:0:0:0 ::

So far IBM support sees no problem in hardware...

Any suggestions??? Smiley Happy

PABB

Reply
0 Kudos
JoergXB
Contributor
Contributor

Hello and thanks for reply.

I think it's not really a hardware problem. I've tested one of the server with Server 2008 and that works. I think it is a bug in the VMware kernel working with the IOMMU.

Finally I decided to disable the chipset feature VT-d in UEFI because I don't need Directpath I/O in this cluster. Since I disabeld it, both Servers are running fine. A heavy workload simulated with iometer runs over 72 hours without any issues.

VMware told me that this bug will be fixed in the next major Update coming in 2nd quarter 2012.

Regards,

Joerg

I am a VM on the hypervisor called earth! A+ Certs, some MS Certs, VCP4, VCP5, hope VCAP 5 coming soon
Reply
0 Kudos
hww
Contributor
Contributor

thanks, jörg. disabling vt-d on our 'certified' supermicros was actually helping us out on that issue, too. every couple of days or so, the servers lost their connection to our FC EMCs (most of the time however, when the load on our SAN was low anyways) . we we're nearly desperate on how to solve this issue! hopefully, VMware will fix that thingie, as you said in the next release / update.

Reply
0 Kudos
JoergXB
Contributor
Contributor

You are welcome. Found out that VT-d is disabled by default on IBM Servers, but the Hardware Guy who did the first setup has enabled it. If you don't need directpath IO you can leave it disabled.

Regards,

Joerg

I am a VM on the hypervisor called earth! A+ Certs, some MS Certs, VCP4, VCP5, hope VCAP 5 coming soon
Reply
0 Kudos
hww
Contributor
Contributor

as a follow up on this: for nearly two weeks now, since disabling vt-d on our Supermicro-cluster, things are doing oh-so-fine here. we support your assumption on that's being a bug within the provided kernel itself (actually, some random native-BSD users are suffering themself from that issue).

feel free to point VMware to this thread, since you seem to be having an open trouble-ticket with them about this already. even though we're not in need of vt-d yet, we'll be going to need it soon and we'd be willing to provide any information to VMware, that could help to help them to hunt this bug down.

it doesn't seem to be machine-specific - you're using IBMs, we're using Supermicro's.

once again, though: thanks for your hint!

post scriptum: we applied the same KB-article as you did in the beginning of this - wasn't helping us either!

best regards,

hww

Reply
0 Kudos
artjackson
Contributor
Contributor

We had about the same issue but I dont have the logs available to compair.  Would disabling VT-d be the same as using this command through SSH

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

That is what VMWare tech support had us to to fix our fiber connections from dropping after we upgraded to ESXi5 with IBM3650 M2 / M3 and qlogic cards.  They said it is an issue IBM is working on fixing.

Reply
0 Kudos
JoergXB
Contributor
Contributor

For us, it doesn't help to only disable IOMMU via ESXCLI. We had to disable the VT-d feature in BIOS.

Regards,

Joerg

I am a VM on the hypervisor called earth! A+ Certs, some MS Certs, VCP4, VCP5, hope VCAP 5 coming soon
Reply
0 Kudos
JoergXB
Contributor
Contributor

It seems like IBM has fixed this problem with an UEFI Update. Have a look here: Link!

Regards,

Joerg

I am a VM on the hypervisor called earth! A+ Certs, some MS Certs, VCP4, VCP5, hope VCAP 5 coming soon
Reply
0 Kudos