Servers appear with "No Heartbeats" on and off

Dr_No2 · ‎03-25-2009

Hello

Whenever we copy large files (100Gb-200GB) from one VM to another or a physical server to a VM, we get lose of heartbeats from mutiple VMs.

Could the network be causing this or the disk? Something seems not to handle it well.

Where should I look?

athlon_crazy · ‎03-25-2009

While do copying, go to your ESX "Performance" tab & monitor the performance for both disk & network.

System Engineer

Zen Systems Sdn Bhd

Malaysia

www.no-x.org

http://www.no-x.org

Erik_Zandboer · ‎03-26-2009

Hi,

Because other VMs (which are not part of the copy action I presume) also loose their heartbeat, I suspect a networking issue. Checking the performance tabs is always a good idea as already mentioned. The more important question is though, how is your network setup? Are you sharing production (VM networks) with the service console? Are you using the same physical uplinks? Do you have any form of loadbalancing for any (or all) network segments? Do you have any non-default settings inside your vSwitches (like traffic shaping)? All these parameters will give more insight on any possible issues.

Visit my blog at http://erikzandboer.wordpress.com

Visit my blog at http://www.vmdamentals.com

Dr_No2 · ‎03-26-2009

What was weird is that after the file copy the system was still acting funny. I just copied a 400Gb file with no problem. The problem may be SQL Server running on that VM.

The attached screenshot shows the graph chopping sections that it could not read from the performance tab.

Could high disk usage cause this?

Erik_Zandboer · ‎03-26-2009

Hi,

Performance data is UDP based - Meaning that if VC is not ready to receive the packets (for whatever reason), the performance data will be missing. Somthing saturates your system - either disk i/o, CPU cycles, maybe even networking i/o . If you want us to look further, we are going to need to know things like network setup, storage used, and of course how many KB/sec is done by what storage etc. It is always complex to tell "from a distance", I would look for high values in the performance chart first.

Visit my blog at http://erikzandboer.wordpress.com

Visit my blog at http://www.vmdamentals.com

Dr_No2 · ‎03-26-2009

I am using an EVA4400 with 48x 300GB FC Disks on 2x 24 port 4GB FC switches. The hosts are HP BL680c G5 blade servers with 4x Quad Core 2.4Ghz and 64GB on RAM.

All ESX hosts are configured on the same VLAN with 4x network cards. 3 assigned for traffic and (sharing the console) and 1 for VMotion. The defauls network configuration on the VSwitches are applied.

Erik_Zandboer · ‎03-26-2009

Now we're getting somewhere. I understand you use one big single LAN to drive everything? If the vSwitches are left at their default, they will load-balance on a round robin principle (based on port-ID). This means that each port on the vswitch is put on "the next" physical uplink. This means that VMs might share a physical connection with the service console or even the VMotion network. You say you assigned three NICs to "traffic" and one to VMotion. Are you using separate vSwitches for this, or did you define active/standby/unused NICs for each portgroup? I think your problem lies here. Troubleshooting is hard, problems are vague because you never really know which network I/O traverses through what NIC, and thus who might be bothering who. For example, if you start a large filecopy on a VM, it might be that this VM is sharing its NIC with the SC of the ESX node. That might cause "holes" in performance graphs. IF you then repeat the same copy action with another VM, it might all be fine (because that VM just happens to use another uplink as the SC).

I would try to keep VM traffic, SC and VMotion on their own physical uplinks. When using 4 NICs, I often configure one big vSwitch, with two uplinks used for VM traffic (with the other two as astandby). The third NIC for SC (with all other NICs as standby), and VMotion on the fourth NIC with the SC NIC as standby or sometimes no NIC as standby (all unused). I configure all ports to be a trunk which can hold all VLANs (and I separate SC, VMotion and VM-traffic on different VLANs)

This setup will get you:

2Gbit for VM-traffic, with failover to the other NICs, and no other traffic being able to come inbetween the VM traffic (production must go on), except SC as a last resort (think about HA!);
A separate uplink reserverd for SC (no more holes in performance graphs), with failover to the Vmotion NIC or even the VM traffic NICs if you prefer;
A separate VMotion uplink, where VMotion cannot get inbetween anything (VMotion stops working if the NIC fails, or possibly fails over to the SC NIC if you prefer)

As you can see, VERY flexible, very reliable. You have the freedom of assigning any NIC(s) to any portgroup, specify the failover NICs and their order, and specify which NICs should never be used to failover to. And this for each and every portgroup!

addendum: I always set the vSwitch default to the VM-traffic, and create exceptions for SC and VMotion. That way all VM traffic VLANs you create will follow the vSwitches default. Saves you a lot of configging when you have 20 VLANs for VM traffic

Visit my blog at http://erikzandboer.wordpress.com

Visit my blog at http://www.vmdamentals.com