Re: ESXi 6.0, VCSA 6.5, Veeam 11, Separate Backup ...

NickDaGeekUK · ‎10-06-2021

Hi Everyone,

I have a complex problem that with multiple levels of software and hardware is a bit of a puzzle.

Background.

I have inherited from my predecessor a pair of HPE servers as physical hosts running ESXi as our Virtual Machine environment hosting our VM servers. These are connected to a VMWare vCenter Server Appliance (VCSA) which is also a VM residing in one of the two physical hosts.

Both physical hosts have a pair of four port 1Gbe Network Interface Cards (NIC). There is Management Network, a VM Lan network and Backup Network configured on each host. We do not have Enterprise Plus so no Distributed Switches in ESXi are possible I believe.

I have a VM running Windows Server on the second of the two physical hosts with Veeam Backup and Replication Server installed. I have a proxy on another VM on the same physical host. During backup we have two issues.

1. CPU and Datastore usage on the second physical host was hitting maximum.
2. Backup on the first physical host was slow due to there being no proxy on the same physical host.

As a result in, collaboration with Veeam Support, I split and reconfigured backup jobs to make sure the proxy assigned to the backup didn't try to back itself up (which helped with extreme duration of some backups), throttled datastore to prevent high latency and configured a proxy on the first physical server to deal with backup jobs on its own physical host.

All seemed fine until I noticed that the traffic from the internet and other software on the VM was now using the Backup LAN on the new proxy on the first physical host and DNS resolution to the host name was going to the wrong IP. This had caused a web server and a FlexLM licence server to bind to the wrong NIC and IP address.

Much research later I discovered that all the Multihomed VMs had a similar problem in that there were two IP addresses in DNS for them. I researched best practice and found it wasn't compliant so I :

removed DNS and Default Gateway for all the backup LAN NICs on each VM and
put in a static route for to the gateway for the backup subnet.
Finally I added Host file entries with both FQDN and Short Name to each of the proxy VMs to force resolution of the DNS name of each proxy to the Backup Lan IP address assigned to them.
It is now a Text Book multihomed configuration according to KBs from both MS and Veeam.

This solved the duplicate DNS entries to seperate IP addresses but killed the backup proxy on the first physical host from the point of view of Veeam.

Low level ping and pathping between the Veeam server and the Proxy on the second physical host is fine and lightening fast due to it being handled internally via what I assume to be the hidden 10Gbe virtual switch between all VMs in ESXi. I say "hidden" because in Hyper-V its visible to the GUI. The problem seems to be from the VM proxy in the first physical host.

There is a significant delay in response when resolving the host name of either veeam server or proxy on the second physical host but eventually it does resolving it to the IP on the backup network adapter. It uses it claims (I am not certain it is true) the default route to the gateway on the backup network to get to the VM on the second physical host and there is no packet loss.

The fly in the ointment is that its not stable (disconnections from Veeam and configuration warnings about the ESXi hosts) and there is a significant delay on hop resolution using pathping or tracert. So I looked deeper with the help of Veeam and found we still have issues connecting to the proxy on the first physical host and issues with name resolution of the ESXi hosts in DNS.

Researching the Configuration of the ESXi hosts and Network there is only one default gateway showing on the TCP/IP stacks and it is to the management subnet.

There are three virtual switches:

Management - containing two networks and one vmkernel adapter showing an IP on the management subnet going to a single physical NIC

Management network (vmkernel adapter)
and
Management Lan (two VMs attached).

VM - containing two networks but no vmkernel adapters going to a team of four physical NICs in a Cisco LAG at the switch.

VM LAN (no vmkernel)
and
VoIP LAN (no vmkernel)

Backup - containing two networks and a single vmkernel adapter showing an IP on the Backup subnet going to two physical NICs in a Cisco LAG at the switch.

Backup LAN (vmkernel adapter)
and
Backup with a single VM (no vmkernel)

What worries me is there is only one default gateway for everything and that is on the Management vmkernel to the default gateway of the management interface on the network switch.

In the default TCP/IP stack there is a routing table

There are three networks: the two IP subnets (management and backup) both with a default gateway of 0.0.0.0 and finally a network of 0.0.0.0. going to the default gateway on the management subnet.

Questions

My limited knowledge of IP networking tells me that with that route table every packet on either IP subnet on every physical NIC has to be sent to the default gateway IP.

I suspect this means that despite separating the traffic inside ESXi into Networks and Virtual switches and two separate four port physical NICs we are actually sending all the traffic to a single gateway on the switch.

Or does the ESXi virtual switch route based on the IP address and default gateway and or static route of the guest OS inside the VM? i.e. do the settings on the header of the IP packet sent by the server determine the routing through the virtual switch. If that is so then what purpose does the IP Route Table in the Default TCP/IP Stack serve?

Can anyone supply an answer to this

If it is the case that the default route inside the ESXi default TCP/IP Stack takes precedence then can anyone show me how to correctly set up the TCP/IP stack and what ESXi components (vmkernel adapters and standard virtual switches) I will need and how to configure them please. I have done a lot of reading of Docs and KBs and sadly am none the wiser. The one thing I have found for certain is that custom TCP/IP stacks seem to require CLI access to the host. Also if required what I need to do on the physical Cisco switch they are connected to please. Ideally I need to get the management network, VM network and Backup Network isolated on to their correct subnets and gateways whilst using the correct physical NICs.

I could of course be barking up the completely wrong tree. 😉

Looking forward to whatever advice or assistance I can get.

All the best

Nick

Kind regards,
Nick.

swaheed1239 · ‎10-07-2021

Hello Nick,

Its actually pretty straightforward. In your case, you need to first start checking the physical switch configuration where your esxi hosts are connected.

NOTE: The default TCP/IP route table you are seeing in your esxi host will show you the management vmkernel port used to manage your esxi host.

As per your description, you said you have 4 ports on each esxi host and you have 2 hosts which makes total of 8 ports going to your physical switch.
As per my understanding, your requirement of network separation is to have 3 different networks one for management, one for backup and one for VM.
- Please be informed that the management IP of your esxi hosts is always be default assigned to the vmkernel port usually named as VMK0. (I hope you are aware of the functionalities and difference between a vmnic, vmknic and vnic)
- FYI, vmkernel ports are used in esxi hosts to handle certain types of traffics internally(within the VMware environment) like management, iscsi, vmotion, FT logging and vsan.

You need to check your physical switch configuration for the 8 esxi ports connecting to it. Check for below:
- Check if the ports are in access mode or trunk mode. If your esxi hosts needs to accept traffic from different subnets/vlans the ports need to be in trunk mode.
- Check if LACP is configured for all the 8 ports or just 4 ports per esxi host.
- All the network separation happens at the L2 physical switch and be informed that esxi standard virtual switch is not a layer 2 device.

If above criteria is met, then you need to configure your standard virtual switches in both of your esxi hosts to further connect your VMs.
- By default the management network is assigned to the vmkernel port vmk0 and is part of vswitch0(unless you've changed it) and it will be mapped to one of the port of your esxi host. Leave it mapped to a single physical port and move forward.
- It would be ideal to map the other port to vswitch0 and create a new port group with name "PG-VM" or "PG-VLANID" in this switch for your 'VM' traffic where your management vmkernel port resides as you need to also keep redundancy of the physical ports in your mind to avoid single point of failure.
- Now you have 2 physical ports mapped to your first virtual switch(vswitch0) through which your VM and Management traffic(Keep in mind it is just management of Esxi and not VMs)will pass.

Now you have 2 ports left on each of your esxi hosts to use for your 'Backup' traffic.
- You can create a new virtual switch and a new port group for backup traffic and map this switch to the other 2 physical ports and name them "PG-Backup" or "PG-VLANID".

Please check VMware docs for the load-balancing modes and modes for redundancy of the physical ports which needs to be configured on the virtual switch level when you configure uplinks.

This way you can separate your traffic on the virtual switch level which does a logical separation as I told you before that the actual separation of networks happens on your L2 physical switch.

Now you can assign the VMs nics to the specific port groups.

Furthermore, as for your backup traffic goes, you can assign a separate nic to the proxy VM from the "PG-Backup" port group and use the IP from the backup subnet.

When the packet from backup port leaves the VM its header will have the route info and it will find its way through the physical ports of esxi to the physical L2 switch where it will find the route information of its vlan.

For DNS issue, you can use the same mechanism which you are using.

Please post the screenshots of your vswitch configuration and topology view if you need further assistance on this matter.

Thanks and don't forget to mark this answer as solution if it resolves your issue.

IRIX201110141 · ‎10-07-2021

KISS! Keep it simple/stupid. Dont mess around with "multihomed" because as you take noticed its hard to control.

Solution 1: If you are really face the problem that Mr. Murphy send the Network traffic of your most important VMs all over the same wire than just tell the ESXi to separate it.

On your already created Backup portgroup modify the failover policy for the vmnics and choose one of your 4 dedicated vmnics and the 3 other as standby. For the rest of the PGs do just the opposite configuration.

Solution 2:
Veeam will easily saturate a 1GB pipe as long as your source is fast enough. So if you cant increase your pipe and have ongoing problems you can throttle the network throughput on your Veeam proxy. Not a common way for on-premise i would say but hey when you have bandwith problems... throttle it.

As general we place a veeam proxy on every single ESXi( in large environments we use NBD from ESXi directly instead of a Proxy) If youre low on Microsoft licenses than think about to create a Linux based Veeam Proxy. If your target is also only 1G please check if its possible to configure a LAG/LACP for that target to take effect your multiple Veeam Proxies.

A normal Veeam Proxy is configured with up to 4 vCPUs which means around 10-12Ghz when fully utilized. Is your ESXi low on pCPU that one VM effects your complete environment? If so reduce the vCPU or play around with CPU limits within ESXi for that given VM.

Never route your Backup traffic.. or at least avoid it when ever possible. So place your Veeam Proxy and the target into the same subnet and be sure that HotAdd is used. If you use NBD for whatever reason stay in the same net as your ESXi Management.
Backup from SAN or SAN Snapshot are different animals.

Regards,
Joerg

NickDaGeekUK · ‎10-07-2021

Wow thanks for that, gives me a better understanding of how the internal ESXi network works. I am looking into physical cabling and port configuration on the switch and will try to relate that to what you have said.

I have 7 physical NICS per host

4 for VM Lan traffic on each host

2 for Backup traffic on each host

1 for Management on each host

I will post my CISCO and ESXi network details as this is making my head spin a bit. 😵
I have just finished the VCSA patch for https://blogs.vmware.com/vsphere/2021/05/vmsa-2021-0010.html

and undoing the damage caused by following "KeyStores with multiple certificates are not supported on the base class", Update Manager Service (...

which breaks the 6.5U3q update manager it's supposed to fix. Suspect the fix only works on 6.7 and above.

Kind regards,
Nick.

NickDaGeekUK · ‎10-07-2021

Totally agree that KISS is the only principle to use in most cases.
Sadly I inherited this one and I am trying to figure out what is going on and if it needs fixing.

I am very much a noob to VMWare coming from a Hyper-V background so please bear with me it's a bit of learning curve.

I in contact with Veeam Support and they are also suggesting the Linux proxy and removing the multihoming.

Have a look at the network I have got in place (posted some screenshots in my reply to swaheed1239)
and see if you can see what is going on and why it doesn't work as I expect, I have had a suspicion for months now that the poor performance on some data transfers off a VM in the first host (which is not multihomed) could be down to a configuration error at ESXi and / or Cisco switch level. I am seeing max 100mbs over a gigabit network on a lot of occasions. However you may well be right it could be saturating a 1 Gbe. Any and all comments gratefully received.

Kind regards,
Nick.

IRIX201110141 · ‎10-07-2021

Well.... iam not sure if this can be solved trough the vmware forum. For me it sounds that a cisco dude (no offense....... hmm maybe a little) have setup your vSphere environment.

Your ESXi 6.0 is out of support
A HA requirement and good practice is to have redundant networking for ESXi Mangement. I only see one uplink in your setup and vSwitch0
Since a Veeam proxy only have one IP/MAC it never will leverage the benefits of a LAG/LACP. The 2 uplinks in your backup are just for redundancy which is ok.. but not for increasing bandwidth. So LAG setup makes this just completely over complicated. Network redundancy is build into ESXi since 2003 and does not need special setup if you need KISS (yes there are reasons for doing things the other way around)
A long time ESXi only supportet one default Gateway and in small setups like yours there is no need for more (more is for the guys with stretched vSAN Cluster, Hybrid Cloud and long distance vMotion). So the standard setup is a VMK0 which holds the Manage VMKernel with its IP, FQHN, DNS, Gateway on vSwitch0 with 2 uplinks for redundancy. Every additional VMK solved a special ESXi related propose like vMotion, FT, iSCSI/vSAN/NFS but none of them is related to VMs
Why someone place tagged and untagged Portgroups on the same vSwitch/Uplinks is out of my mind. Tag all PGs or leave it.
Dedicated Veeam proxy should excluded from any backup job because there is no need for Backup these "helper" VMs. If you have a virtual Veeam server than there is per default also the proxy role deployed. You can disable this role so there is no danger that you backup your backup and your backup (when doing HotAdd). In general Veeam detects if there is the proxy role installed inside a VM and will place this VM always to the end of the list of VMs for backup to avoid this problem.
You should remove the "Backup" VMK

I always thought that LAG/LACP is only supportet with vDS and not vSS but maybe things have changed
Your 100MB/s is around 1Gbit so you utilize your phys. wire. Keep in mind that if a Proxy sends 100MB/s over the wire to the backup repository most likely it reads 200Mb/s from the storage and do compression, deleting Pagefile zeroed block when performing a Fullbackup and with HotAdd. If you do this with 3 proxies in parallel its up to 600MB/s -> 300MB/s to the target which is often the bottleneck in SMB environments
If you have the windows OS licensed around than use a Windows Proxy instead of Linux(more time consuming to setup) but that depends on your skills
I have not verified which VMk is the leading one in your setup

Regards,
Joerg

NickDaGeekUK · ‎10-07-2021

Hi Joreg

"Well.... I am not sure if this can be solved through the vmware forum. For me it sounds that a cisco dude (no offense....... hmm maybe a little) have setup your vSphere environment." ROFL 🤣.

That makes sense to be honest with you, I have a limited understanding of ESXi networking and was puzzled by the Port Group and vmk ideas. If I understand what you have told me the vmk is for services like ISCSI, vMotion etc.

We are using ISCSI but I can't see any ISCSI assigned to the backup vmk so I am assuming it is going straight out from the VM over the 4 port LAG and not touching the Backup LAN at all. The fact that the vSwitch isn't Layer 2 aware was something I really had no clue about but was starting to suspect.

"Your ESXi 6.0 is out of support" 😓 I know,

Ancient hardware so can't push it too far forward, best I can achieve is likely to be 6.5 based on hardware compatibility of CPU. I am looking at patching next.

You have given me a lot to think about. Will come back to this tomorrow.

Thanks

Nick

Kind regards,
Nick.

All

ESXi 6.0, VCSA 6.5, Veeam 11, Separate Backup Network Configuration: how to?