We have some new HP BL460c servers, and have recently installed the HP build of ESXi 4.1 due to it containing the Emulex NIC drivers
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPVM06
3 hosts have been configured within a cluster to talk to our HP P4000 iSCSI SAN, all via these Emulex Oneconnect hardware storage adapters
We have HA enabled on the cluster
What I'm finding from the outset, is that as I roll out VM's to the hosts, and I hit it with migration/deploy operations, then something will eventually timeout and the host will drop it's connection to vCenter.
However, I can ILO to the host, can ping it from my laptop, but any VM's running from that host will die completely, and HA is not recognising they are down, so does not try to boot it up on another host. Not that softens the blow, as ultimately, those machines still go off, and to the end user, that's not the best. I digress.....
I've been scouring the logs, and internet, for solutions, and can't seem to find an exact fix for it, and certainly not one that pertains to ESXi
I found the following, which relates to ESX, but is the closest I can get to what is actually happening
Further on from my earlier comments about me getting on the ILO, I can get on, restart the management agents (although this takes a few minutes), but if try to restart the host, it will just die with the message "restart in progress"
I connected via SSH and did a services.sh restart and it cries about the vmware-aam service not being able to start properly, as well as giving me some other errors (I apologise as I've since cold booted and problem isn't apparent right now, I'll post if it reoccurs, but I'm mainly looking for somebody to spot something obvious.
It's looking increasingly like either a) something inside the HP build, b) some kind of bug with the newer hardware (so related to A in a way, or c) some storage-related issue that's causing the host to panic and bomb out
When I do a ps | grep hostd , there are instances I cannot kill off with kill -9 - could this be related
The only way to get it back up is to cold boot the host
Any advice or points would be great, and I'll do my best to see through to a resolution.
I've found as well that storage / vmfs related issue will typically kick an ESX host off of vCenter, as the host dedicates quite a lot of resources to bring that resource back online. There should be evidence of this in the messages log on the host itself. My issues were on the HP blades as well, but over FC, as opposed to iSCSI, but a reset on that bus would allow us to connect again, without a full reboot. Sometimes, a reboot was needed. You should see something related to an 'All Paths down' message, or you can search for All Paths Down or APD for more info.
-KjB
We had similar issues with some Dev hosts and a Dell SAN environment. Eventually it was resolved with a firmware update on the Raid Controllers for the SAN.
I suspect that commands to the iSCSI network were getting queued up to much; when this happens the ESXi memory starts filling up with commands to be executed, eventually filling up the memory on the host and causing pretty much everything to die. The only fix was to reboot the hosts.
One trouble shooting move could be to try using ESXi 5.0 against your SAN to see what sort of result you get, if it is the same then it is probably an issue with your SAN or host to SAN networking.
Regards,
Paul
Hi
I've checked this morning, and on a particular host, VM10, which has been sat idle with new servers on, no user activity....in fact, no activity at all, it has dropped offline again. Well, I say offline, but just cannot connect to vCenter
I can't export the logs from vCenter because it can't see it, so I'm trying to see what I can get manually from it, please bear with me.
Having not done this before, but going from other servers, the Emulex hardware appears to use SCSI adapters vmhba0 and vmhba1
I did a rescan of these using esxcfg-rescan vmhba0 which failed with Unable to scan adapters for VMFS
I then, as per another article, vmkfstools -V and this too failed, with errors as shown below from the messages log
vCenter first shows VM10 as being unable to synchronize at 00:03, and keeps displaying that message right up to 04:17, at which point it marks it as not responding, shows the hosts as being disconnected, and then no Events are logged for that host after that
Unfortunately, my messages log appears full and only goes back as far as the last 100 minutes or so, and it is now 06:13
There was nothing relating to APD (all paths down)
Any further suggestions?
To pick up on the ESXi 5.0 suggestion, we do have that as an option, but no vCenter 5 yet, that's to do when opportunity arises.
None of our GL360 physical hosts have this issue, but they use software ISCSI, not the hardware that the blades use
I should point out that the main difference, with networking, between blades and other physical servers, is that we're doing some 10Gb upgrades. The iSCSI SAN is going to 10Gb, but are currently uplinked to the core with 1Gb only. The blades, however, are uplinked to the core at 10Gb - could this be some kind of bottleneck?
Just to add to this, these are the results from running services.sh restart via SSH
Working host
Not working (VM10)
Hi, I had similar issues in the past with the vmkernel and NFS when the network settings were set to auto. Removing inter-VLAN routing, forcing all the NICs and switch ports to 1000/full (including the Gb2E ports connected to the blade servers) resolved the dropouts.
Looking at the log output, you're getting permission denied when you try to mount your devices. Can you check the logs on your storage side?
-KjB
Hi again
After some digging around, and pouring over of logs, we couldn't find an exact solution to this.
Our presumption, at this stage, is that it is some kind of bug with hardware iscsi, the BL460c servers and/or the use of the Emulex Oneconnect CNA's
We've rebuilt each of the hosts and reconfigured to use software iSCSI and will continue to monitor how they perform.
Yes, this comes with the additional performance overhead, but with projects looming, it buys some time, especially if it works. I'll advise once I know more, and it might help somebody else tackle a problem.
Whether or not it's fixed in ESXi 5.0 (which we're planning on moving too), remains to be seen, but we can prepare for that a little better
Cheers
Oh, here was another instance of a similar issue, but this guy went down a different route in order to stick with the hardware iSCSI
the permission denied errors pertain to the fact that we have blades with 4Gb flash cards that it boots from.
When scanning the Emulex Oneconnect adapters, vmhba0 and vmhba1, it references partitions on the flash card, vmhba32, that it cannot read due to being different file systems