VMware Cloud Community
call_percula
Contributor
Contributor

Scanning software iSCSI HBA leaves host in unstable state

Lots of changes in the environment recently and this little "joy" cropped up. I have open tickets with Juniper, Dell EQL and VMware with everyone more or less pointing the finger at the other.

I'm trying to see if anyone else has seen anything like this before, and if they have what if any thing they might know about dealing with it.

Details:

Dell M620 blades with current firmware on them.

ESXi 5.5.x (I say "x" because I see this problem with a clean install of 1331820 and with 1474528)

(4) 10Gb NICs per host. (2) Dedicated to data and management traffic, (2) dedicated for iSCSI traffic only.

Juniper EX4500 stack running 12.2R5 software, jumbo packets are configured and working, flow control is enabled and no errors are seen on interfaces.

(2) EQL 6510X (1) 6510E all running 6.0.7 firmware. No warnings etc on these.

I am using vDS on the cluster of (16) blades, I have one for data and management traffic with several port groups and one for iSCSI only. One the iSCSI I have (3) port groups, one for heartbeat only with both nics bound and the lowest numbered vmknic in that port group. I have each vmknic in each of the two remaining portgroups with one nic bound per port group.

I have installed the EQL MEM 5-1.2.0.365964 via the update mgr and enabled iSCSI multipath.

Now all of this has worked great without a problem until recently. We lost a controller on an array during a firmware upgrade. That controller is still there because until I resolve this issue I can't replace it without fear of another total outage.

When I scan for new storage, just HBA the operation times out after about 20 mins. vCenter loses connection to the host. I can console and sometimes SSH to the host, but it is VERY unstable, for example doing a tail -n 100 /var/log/syslog will lockup the ssh session. Sometimes I can restart the services with services.sh and get the host back to a manageable state, sometimes it takes a power cycle to get it up again. I had to turn off the USBarbitrator service... before this issue the service was set to auto start on boot and was never a problem, after I started having this issue, it would hang during the reboot with starting USBarbitrator service for several hours before timing out and finishing the boot up process.

0 Kudos
1 Reply
alex_g60
Contributor
Contributor

We just encountered (almost exactly) this same problem.

Our setups are pretty close.   We have 4 Dell R720 connected to a Nimble SAN (iSCSI) via redundant 10Gb links over Extreme Network switching.  Jumbo frames are also on for us.  All seems to be working fine until this one HBA scan.

I haven't upgraded our hosts from 5.1 to 5.5 yet.  However our vCenter (SSO, Inventory, vCenter) is on 5.5 P01 (1476327).

If I were to upgrade the hosts, it would have been to that 5.5 P01 version of 1474528.


But you are correct, a reboot seems to be the only way to bring the host back to life.  (Which is unacceptable in my opinion).

If vMotion was possible during this HBA scan issue, it wouldn't be that big of a deal.


I wouldn't think the differing versions between Host and vCenter would have anything to do with it.


Could you please reply once you have any sort of direction on what the possible solution is?  I'm not getting anything concrete from VMware at the moment.  I have yet to contact the storage vendor about this issue.  (You are correct about the finger pointing.)


Thanks for posting this!

0 Kudos