VMware Cloud Community
Candell
Contributor
Contributor

High iowait stalling ESX host

Hi all!

My company recently deployed a two-node ESX 3.5 cluster in our office in Tokyo. Everything went fine but soon after we came home I noticed that some VM's was periodically unresponsive. It turned out that it was only VM's on one of the hosts, and that this host have frequent spikes with 95-100% iowait. With frequent I mean about twice a minute, and each spike lasts 10-20 seconds.

A reboot fixes the issue, but it soon returns. Of course the servers are installed and configured identically.

Hardware:

2 x hp Proliant DL380 G5, 8 cores, 32GB RAM.

ESX 3.5 installed on local SAS-disks, mirrored w/ spare.

SAN is a hp MSA2012fc.

Each server has a dual HBA's connected to the MSA with full redundancy.

Each server also has 6 NIC's, of which 5 are used:

- vmnic0: Service console, 1000 full

- vmnic1: Service console 2, 100 full (crossover cable between hosts)

- vmnic3: VMotion (crossover cable between hosts)

- vmnic4+vmnic5: VM lan-switch

I've tried shutting down processes one after another so see if anyone of them caused the issue, and have looked all over the web for troubleshooting iowait issues but I can't find any really good advice. I'm tempted to reinstall the host from scratch, but since it's on the other side of the planet (I'm in Sweden, ESX-cluster in Japan), it's a bit complicated.

I'd be grateful for any advice you can give me.

Thanks

0 Kudos
4 Replies
Gerrit_Lehr
Commander
Commander

Well first of all (I don't think that is the issue tho), what is the second service console with a x-over for? That doesn't make any sense at all to me.

Regarding your iowait issues, is /var/log/messages or dmesg reporting any error? Any Agents running on the hosts? Are the ESXs accessing the SAN Luns?

Kind Regards,

Gerrit Lehr

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Kind regards, Gerrit Lehr If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
Candell
Contributor
Contributor

The crossover is to fulfill the redundancy needed for service console. If we put both in a physical switch that switch would be a single-point-of-failure.

Don't really know what to look for in the logs, but nothing looks really evil.

/var/log/messages

Got some of this stuff, but the working node got the same.

Jun 3 17:33:17 jpesx02 modprobe: modprobe: Can't locate module char-major-14

Jun 3 17:33:17 jpesx02 modprobe: modprobe: Can't locate module char-major-14

Jun 3 17:33:17 jpesx02 modprobe: modprobe: Can't locate module block-major-2

Also this, but both nodes got them:

Jun 3 17:33:03 jpesx02 watchdog-openwsmand: PID file /var/run/vmware/watchdog-openwsmand.PID not found

dmesg

Nothing special

Agents

Only agents running is the hp SIM agents. The first thing i did was to uninstall them and see if that made any difference. It didn't and I've reinstalled them.

SAN

The problem stays even if all VM's are migrated to the other host. Don't know if there is some sporadic IO to it anyway.

I just got our guy in Tokyo to completely disconnect it from the SAN, but i'm still seeing around 97% iowait.

0 Kudos
Gerrit_Lehr
Commander
Commander

Well, but if the pSwitch fails - how are you gonna reach the Service Console anyway? Even worse, if the first service console fails to get the HA heartbeat (which might mean that the VMs are isolated, too if the vm network is connected to the same pswitch) HA might not kick in due to a heartbeat thru the 2nd SC. Maybe you should have a look at this dic. http://www.vmware.com/files/pdf/VMwareHA_twp.pdf

However, did you try to disconnet the second SC to see if maybe that is an issue? Might be that one of the agents does not handle this constelation very well and causes the high iowait. If the logs and dmesg don't report any I/O errors, it should be fine tho.

Kind Regards,

Gerrit Lehr

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Kind regards, Gerrit Lehr If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
Candell
Contributor
Contributor

Crossover is only for heartbeats, and not for console connections. I don't really understand your argument for not having it on a crossover cable. We don't want HA to kick in just because we loose the network.

However, I removed the nic from the 2nd SC just to see if it did any change, but it didn't.

0 Kudos