We have a five-node vSphere 5.1 cluster, which has been working fine for years. The fifth host was recently added and is having performance problems. These problems manifest themselves as extremely slow VM performance, especially when attempting to log on to or reboot a (Windows) guest. As an example one Windows server we have will reboot in 4 minutes on any of our other hosts yet takes 25 minutes to reboot when on host 5.
I opened a case with VMWare support, and they identified the cause as the host disconnecting from its storage. I saw the errors in the logs myself (All Paths Down), so I don't question their diagnosis, but they were not able to tell me anything beyond that and told me to contact the storage vendor.
All five hosts are connected via FC through a single Brocade switch to a single Falconstor SAN. There's no multi-pathing involved. All hardware is on each vendor's HCL. VMWare confirmed the HBA drivers are correct and did not find any issues with the server configuration. All five hosts are on the same patch level, 5.1.0 1743533. Basically everything I can think to look at on the VMWare side seems fine. There are no errors on the Brocade switch.
Anyway, I contacted Falconstor, they checked their side of things and are telling me everything looks good. Specifically I'm being told the following (quoted verbatim): "If the client disconnects a path with us on the upstream, [the SAN] will not log the activity. We will log LUN resets and command aborts coming from the client that is affecting our targets but we do not see any of that."
I'm at a total loss here and not sure where to go next. I have Production Support with VMWare and the equivalent with Falconstor yet I'm in the classic "it's the other vendor" situation and am stuck.
Any advice as to my next step, how to troubleshoot further, etc., would be greatly appreciated.
Well the first thing I would suggest is the having a three way call with your storage vendor and teh Vmware rep if the community is unable to reslove the problem with you. Most vendors will usally agree to this after some finger pointing has taken place. With that said lets see what we can find.
Post a copy of your VMkernel.log
Another thing that could be causing an issue is the zoning on your FC. How do you have the zonging setup. Is it software or hardware. If it is hardware are you splitting it out at all or is every port just * *(essentially everything open) You could try zoning off the portsn from host5 away from the rest of the ports if the others are in a * * configureation, then see if the same problem occurs
Also currently on host5, and host 4 run the following command at the CLI:
esxcfg-mpath --list-paths --device <device mpx/naa name> | grep state
If you don't know the problem device (HBA) you can just run this:
esxcfg-mpath -b |grep -C 1 dead
This will show if there is any active all paths down. You will want to make sure there is no ALL-Path-Down prior to starting troubleshooting so you can try and ping point the issue.
Also you can check ESXTOP and look for the DVAG on your HBA. Your DVAG should be under 100, the odd spike is okay but if it living higher then 100 constantly you have a communication problem with your storage. The DVAG is the over all latency or time it takes to send a command to your SAN and get a response.
Hope this helps,
Thanks JPM300, I appreciate the reply. I've attached the vmkernel.log.
I'll admit to not being a FC expert, but as far as I know the zoning is in hardware; I'm configuring the zoning on the switch by creating and applying a zone config. As far as how it's setup, the five VMWare hosts can see each other and the storage array. Everything else is isolated.
Right now, no paths are down, and I should elaborate on that a bit. I've been having the performance issues for a couple weeks since the server was setup. Twice now, in anticipation of troubleshooting, I've attempted to put the problem server in maintenance mode migrating active VMs in the process. After migrating 4-5 VMs to another host, the problem host would completely hang up, requiring a cold boot of the ESXi server. I was told by support that the loss of connectivity to storage (i.e. all paths down) is what was causing the host to hang, and that the "fix" for that is a reboot of the host.
After rebooting, the paths do reconnect and I have not seen them disconnect so long as I don't try to vMotion a ton of stuff at once.
I've been watching the DAVG all morning and it's been consistently above 70 on the problem host. The same reading on my other hosts is typically < 2. So I think it's safe to say I have a latency issue between the problem host and my array. Now it's just a matter of where.
Thanks again, James
Yeah I would deffiently say if all your other hosts with the exact same LUN's are averaging around <2 DAVG while your new host is constantly around 70 on the same hosts there is deffently a latency problem with that host. I'm also willing to be once you put more VM's on it that number would jump high enough to the point where resends just disconnnect the LUN. If there is to many resend requests or a timeout is to long the VMware host will think it's disconneted giving a APD error message.
It could be the zoning as I've seen this before however, when I saw it previously it was a MUCH large enviroment and everyone was experiecing the issue. Since only 1 host is seeing the lag I'm inclined to think the problem is isolated there. I'll take a look throught he VMkernel.log and get back to you in a bit
Are you currently running SIOC(Storage IO Control) with Storage DRS ?
Also I'm seeing a lot of errors around this world ID / storage:
CmdSN 0x2d3 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed
Is this a new LUN that could of been created at the same time when the new host came online?
I'm seeing this path failing A LOT:
Try and look into that path and see if you can trace that one out. Could be a bad fibre connections or SFP
Furthermore to add onto the vmware log. I see a bunch of these:
"naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T11:31:32.636Z cpu19:8211)ScsiDeviceIO: 2331: Cmd(0x41244425e080) 0x1a, CmdSN 0x2c3 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-06-16T12:01:32.708Z cpu24:8216)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124456c51c0, 9481) to dev "naa.6c81f660ea0af8001b05d51807353714" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-06-16T12:01:32.708Z cpu24:8216)ScsiDeviceIO: 2331: Cmd(0x4124456c51c0) 0x85, CmdSN 0x2c4 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T12:01:32.708Z cpu24:8216)ScsiDeviceIO: 2331: Cmd(0x4124456c51c0) 0x4d, CmdSN 0x2c5 from world 9481 to dev
I'm not sure how new they are or if they keep showing up, its something you can check for, but with that said this error:
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 seems to point to the VMK_SCSI_HOST_OK error meaning there was no error from the host side, however your seense data is a little different so i'm not to sure. With that said it seems to point to the fact that the VMware host is making the communcation attempts properly and it would either be a STAP driver / pathing issue or a connection / zoning issue
Hmmn, that's the device's local storage. I took a look at one of my other hosts and see the same errors. If I'm reading things correctly it looks like that's an error that can be safely ignored, per this KB article. vmhba2 is the FC card connected to storage.
Off-the-wall question...while these LUNs are FC connected, I could fairly easily switch them to iSCSI. Is that worth doing? I'm thinking if the latency is normal on iSCSI that would pretty well point to it being something on the FC side. Of course the disadvantage is while iSCSI is setup, we've never used it other than for occasionally mounting a LUN from our backup appliance, so I have no frame of reference for performance.
Thanks again for all your help.
You could, the problem is you would have to setup ISCSI across all the hosts which would be a lot of work to test one item. If you click on a host and select configuration > storage > and click on the LUNS you should be able to hunt down the NAA. number coming up with the failures. Also you should be able to hunt down this path the same way or in the storage screen : vmhba1:C2:T0:L0
See what that is pointing to first.
Ah, I wasn't sure but I thought I could present the same LUN to certain hosts via FC and to the problem host with iSCSI. I agree that reconfiguring them all would be too much work.
I think you might have missed my prior post. vmhba1 is the host's local storage and it looks like the error is benign, or at least something that is happening on at least one other functioning host.
Again, thanks for your help. I'm learning a lot in the process.
hmmm wierd. Would you be able to provid the tail of the vmkernel log now
you can just go into the CLI and type tail /var/log/vmkernel.log and past it here or take a screenshot, which ever.
Here you go. I do have a storage vMotion running on that host right now (from a FC LUN to an iSCSI LUN) just to see what DAVG looks like on the iSCSI LUN.
2014-06-16T16:33:41.434Z cpu16:251489)NetPort: 1392: enabled port 0x4000042 with mac 00:50:56:9c:19:cf
2014-06-16T16:33:47.361Z cpu16:251717)SVM: 4117: SkipZero 0, dstFsBlockSize -1, preallocateBlocks 1, vmfsOptimizations 1
2014-06-16T16:33:47.361Z cpu16:251717)SVM: 4176: SVM_MakeDev.4176: Creating device a59915-8ad91d-svmmirror: Success
2014-06-16T16:33:47.361Z cpu16:251717)SVM: 4212: Created device a59915-8ad91d-svmmirror, primary a59915, secondary 8ad91d
2014-06-16T16:33:47.361Z cpu16:251717)VSCSI: 3748: handle 8230(vscsi0:0):Using sync mode due to sparse disks
2014-06-16T16:33:47.361Z cpu16:251717)VSCSI: 3789: handle 8230(vscsi0:0):Creating Virtual Device for world 251489 (FSS handle 11966776)
2014-06-16T17:01:33.359Z cpu26:8218)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x412447401480, 9481) to dev "naa.6c81f660ea0af8001b05d51807353714" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-06-16T17:01:33.359Z cpu26:8218)ScsiDeviceIO: 2331: Cmd(0x412447401480) 0x85, CmdSN 0x2e2 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T17:01:33.359Z cpu26:8218)ScsiDeviceIO: 2331: Cmd(0x412447401480) 0x4d, CmdSN 0x2e3 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T17:01:33.359Z cpu26:8218)ScsiDeviceIO: 2331: Cmd(0x412447401480) 0x1a, CmdSN 0x2e4 from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Yeah I don't see any glaring issues in that snippit of the log. To awnser your question with iSCSI the DVAG numbers still need to stay the same, under 100, usally Fiber has an easier time of doing this, but with 10GB iSCSI becoming more popular / mainstream its becoming a toss up.
Either way, thats the number you want to stay under.
I still think there is either a cable problem, path problem, or STAP issue. Why it would only effect that one host is odd tho. Is the STAP and pathing the same on all hosts?
SATP stands for Storage Array Type Plugin, its the drive Vmware uses for its MPIO, your PSP is your Path Storage Profile aka round robin, fixed, most recently used. They tie into each other closely
You can get your SATP by typing:
esxcli storage nmp satp list
check the SATP on the other hosts as well and make sure there the same. After that check your pathing policy and make sure its using the same fixed path. Another thing you can try is change which path it is taknig. You can do this in the GUI by right clicking on the datastore and going properties. From there select manage paths.
If you turned your Fixed path over to the 2nd Fiber connection and the problem still occurs the paths are problem fine as both are doing it. However if you pick a different fiber HBA/Path and the problem goes away, it could be that one cable and or path.
Sorry, read that as STAP, and google didn't turn up any results.
I don't have multipathing so I've never really looked into this, but checking just now the SATP list shown is the same across the hosts. Storage Array Type shows as "VMW_SATP_ALUA" which is what Falconstor indicates it should be, and path selection is MRU (VMWare). This is consistent across hosts as well.
I've got a spare FC port on the server and my FC switch...I'm thinking I should run some fibre and see what happens. I'm thinking more and more this is bad hardware somewhere.
yeah MRU is most recently used. It will just use the path it last saw data down, however if you run another cable you can just change the path in the options, or if you don't want to run another cable, shutdown the host and just move it over to the other connection. If you get the same result, you can move the port over in the FC. What you would want to see is if both hba's / paths are having the same behavior. Another option is you can get the storage vendor on the line and see if they can see anymore logs.
Let me know how it plays out,
So I added the second path (which is just the second port on the same HBA, going through the same FC switch) and swapped a few LUNs from MRU to RR. DAVG on the newly connected vmhba3 went through the roof, far higher than on vmhba2. (at times > 2000)
I've since switched back to MRU and all the traffic is on vmhba2 (again). I'm now getting the below repeatedly in vmkernel.log, along with e-mails that the various paths on vmhba3 are failing.
Man, this is frustrating. Thanks for holding my hand through it all.
2014-06-16T18:31:09.481Z cpu16:428153)WARNING: ScsiPath: 6144: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba3:C0:T0:L5
2014-06-16T18:31:09.516Z cpu14:428417)<6>qla2xxx 0000:44:00.1: ZIO mode 6 enabled; timer delay (100 us).
2014-06-16T18:31:19.483Z cpu26:8245)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:647: Path "vmhba3:C0:T0:L6" (DOWN) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-06-16T18:31:19.484Z cpu2:8740)<3> rport-8:0-0: blocked FC remote port time out: saving binding
2014-06-16T18:31:29.486Z cpu26:8245)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:647: Path "vmhba3:C0:T0:L8" (DOWN) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-06-16T18:31:33.551Z cpu14:8206)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x41240574c740, 9481) to dev "naa.6c81f660ea0af8001b05d51807353714" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-06-16T18:31:33.551Z cpu14:8206)ScsiDeviceIO: 2331: Cmd(0x41240574c740) 0x85, CmdSN 0x2eb from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T18:31:33.551Z cpu14:8206)ScsiDeviceIO: 2331: Cmd(0x41240574c740) 0x4d, CmdSN 0x2ec from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-06-16T18:31:33.551Z cpu14:8206)ScsiDeviceIO: 2331: Cmd(0x41240574c740) 0x1a, CmdSN 0x2ed from world 9481 to dev "naa.6c81f660ea0af8001b05d51807353714" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-06-16T18:31:39.519Z cpu16:428417)<4>qla2xxx 0000:44:00.1: Failed mailbox send register test
2014-06-16T18:31:39.519Z cpu16:428417)<6>qla2xxx 0000:44:00.1: qla2x00_abort_isp: **** FAILED ****
2014-06-16T18:31:39.519Z cpu31:8754)ScsiScan: 512: Path 'vmhba3:C0:T0:L4': Failed to read VPD Device id page: Timeout
2014-06-16T18:31:39.952Z cpu16:8883)<6>qla2xxx 0000:44:00.1: Performing ISP error recovery - ha= 0x41002480f030.
2014-06-16T18:31:39.987Z cpu16:8883)<6>qla2xxx 0000:44:00.1: ZIO mode 6 enabled; timer delay (100 us).
2014-06-16T18:31:40.009Z cpu16:8883)<6>qla2xxx 0000:44:00.1: FW: Loading via request-firmware...
2014-06-16T18:31:40.210Z cpu16:8883)<4>qla2xxx 0000:44:00.1: Firmware dump previously allocated.
2014-06-16T18:31:41.322Z cpu16:8883)<6>qla2xxx 0000:44:00.1: LOOP UP detected (4 Gbps).
2014-06-16T18:31:41.388Z cpu16:8883)vlan_id: 0x0
2014-06-16T18:31:41.388Z cpu16:8883)vn_port_mac_address: 00:00:00:00:00:00
2014-06-16T18:31:42.650Z cpu18:8883)<3>qla2xxx 0000:44:00.1: Unable to read FCP priority data.
Yeah I would get Vmware and your Storage vendor back on the phone as you have done a bunch of testing and have some new results and or information to share with them. They should hopefully be able to find the root of the problem now. Also you can try and get a three wall call going with them to stop the finger pointing.
I found this article, buti'm not sure if its the same issue your having:
Also this is probably just becuase of the new path testing I would assume.