VMware Cloud Community
addy4488
Contributor
Contributor

vCenter connection to host dies completely (ESXi 4.1)

We have some new HP BL460c servers, and have recently installed the HP build of ESXi 4.1 due to it containing the Emulex NIC drivers

https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPVM06

3 hosts have been configured within a cluster to talk to our HP P4000 iSCSI SAN, all via these Emulex Oneconnect hardware storage adapters

We have HA enabled on the cluster

What I'm finding from the outset, is that as I roll out VM's to the hosts, and I hit it with migration/deploy operations, then something will eventually timeout and the host will drop it's connection to vCenter.

However, I can ILO to the host, can ping it from my laptop, but any VM's running from that host will die completely, and HA is not recognising they are down, so does not try to boot it up on another host. Not that softens the blow, as ultimately, those machines still go off, and to the end user, that's not the best. I digress.....

I've been scouring the logs, and internet, for solutions, and can't seem to find an exact fix for it, and certainly not one that pertains to ESXi

I found the following, which relates to ESX, but is the closest I can get to what is actually happening

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101060...

Further on from my earlier comments about me getting on the ILO, I can get on, restart the management agents (although this takes a few minutes), but if try to restart the host, it will just die with the message "restart in progress"

I connected via SSH and did a services.sh restart and it cries about the vmware-aam service not being able to start properly, as well as giving me some other errors (I apologise as I've since cold booted and problem isn't apparent right now, I'll post if it reoccurs, but I'm mainly looking for somebody to spot something obvious.

It's looking increasingly like either a) something inside the HP build, b) some kind of bug with the newer hardware (so related to A in a way, or c) some storage-related issue that's causing the host to panic and bomb out

When I do a ps | grep hostd , there are instances I cannot kill off with kill -9 - could this be related

The only way to get it back up is to cold boot the host

Any advice or points would be great, and I'll do my best to see through to a resolution.

0 Kudos
9 Replies
kjb007
Immortal
Immortal

I've found as well that storage / vmfs related issue will typically kick an ESX host off of vCenter, as the host dedicates quite a lot of resources to bring that resource back online.  There should be evidence of this in the messages log on the host itself.  My issues were on the HP blades as well, but over FC, as opposed to iSCSI, but a reset on that bus would allow us to connect again, without a full reboot.  Sometimes, a reboot was needed.  You should see something related to an 'All Paths down' message, or you can search for All Paths Down or APD for more info.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
logiboy123
Expert
Expert

We had similar issues with some Dev hosts and a Dell SAN environment. Eventually it was resolved with a firmware update on the Raid Controllers for the SAN.

I suspect that commands to the iSCSI network were getting queued up to much; when this happens the ESXi memory starts filling up with commands to be executed, eventually filling up the memory on the host and causing pretty much everything to die. The only fix was to reboot the hosts.

One trouble shooting move could be to try using ESXi 5.0 against your SAN to see what sort of result you get, if it is the same then it is probably an issue with your SAN or host to SAN networking.

Regards,

Paul

0 Kudos
addy4488
Contributor
Contributor

Hi

I've checked this morning, and on a particular host, VM10, which has been sat idle with new servers on, no user activity....in fact, no activity at all, it has dropped offline again. Well, I say offline, but just cannot connect to vCenter

I can't export the logs from vCenter because it can't see it, so I'm trying to see what I can get manually from it, please bear with me.

Having not done this before, but going from other servers, the Emulex hardware appears to use SCSI adapters vmhba0 and vmhba1

I did a rescan of these using esxcfg-rescan vmhba0 which failed with Unable to scan adapters for VMFS

I then, as per another article, vmkfstools -V and this too failed, with errors as shown below from the messages log

vmissue1.JPG

vCenter first shows VM10 as being unable to synchronize at 00:03, and keeps displaying that message right up to 04:17, at which point it marks it as not responding, shows the hosts as being disconnected, and then no Events are logged for that host after that

Unfortunately, my messages log appears full and only goes back as far as the last 100 minutes or so, and it is now 06:13

There was nothing relating to APD (all paths down)

Any further suggestions?

To pick up on the ESXi 5.0 suggestion, we do have that as an option, but no vCenter 5 yet, that's to do when opportunity arises.

None of our GL360 physical hosts have this issue, but they use software ISCSI, not the hardware that the blades use

I should point out that the main difference, with networking, between blades and other physical servers, is that we're doing some 10Gb upgrades. The iSCSI SAN is going to 10Gb, but are currently uplinked to the core with 1Gb only. The blades, however, are uplinked to the core at 10Gb - could this be some kind of bottleneck?

0 Kudos
addy4488
Contributor
Contributor

Just to add to this, these are the results from running services.sh restart via SSH

Working host

vmissue2.JPG

Not working (VM10)

vmissue3.JPG

0 Kudos
VasooV
Enthusiast
Enthusiast

Hi, I had similar issues in the past with the vmkernel and NFS when the network settings were set to auto. Removing inter-VLAN routing, forcing all the NICs and switch ports to 1000/full (including the Gb2E ports connected to the blade servers) resolved the dropouts.

http://veerapen.blogspot.com
0 Kudos
kjb007
Immortal
Immortal

Looking at the log output, you're getting permission denied when you try to mount your devices.  Can you check the logs on your storage side?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
addy4488
Contributor
Contributor

Hi again

After some digging around, and pouring over of logs, we couldn't find an exact solution to this.

Our presumption, at this stage, is that it is some kind of bug with hardware iscsi, the BL460c servers and/or the use of the Emulex Oneconnect CNA's

We've rebuilt each of the hosts and reconfigured to use software iSCSI and will continue to monitor how they perform.

Yes, this comes with the additional performance overhead, but with projects looming, it buys some time, especially if it works. I'll advise once I know more, and it might help somebody else tackle a problem.

Whether or not it's fixed in ESXi 5.0 (which we're planning on moving too), remains to be seen, but we can prepare for that a little better

Cheers

0 Kudos
addy4488
Contributor
Contributor

Oh, here was another instance of a similar issue, but this guy went down a different route in order to stick with the hardware iSCSI

http://communities.vmware.com/thread/325255?tstart=0

0 Kudos
addy4488
Contributor
Contributor

the permission denied errors pertain to the fact that we have blades with 4Gb flash cards that it boots from.

When scanning the Emulex Oneconnect adapters, vmhba0 and vmhba1, it references partitions on the flash card, vmhba32, that it cannot read due to being different file systems

0 Kudos