vSphere 7u1 environment in my homelab, VMUG.
vCenter is running on a host I want to decommission, but is still connected to the standard vSwitch. I have the same network on my distributed vSwitch, and I have other VMs on that distributed portgroup that work without issue.
When I change the vCenter network binding to the distributed portgroup, vCenter shows all the hosts in a Not Responding state.
In this state, from vCenter I can ping the hosts as well as curl on port 902, as well as the hosts can ping vCenter and other VMs on that distributed portgroup. I can also continue to access the vSphere UI without issue. When I log in to the host and change vCenter back to the vSwitch portgroup, hosts eventually return to a healthy state.
I am seeing these errors (gathered by Runecast):
message-syslog warning rhttpproxy [Originator@6876 sub=Proxy Req 27891] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE(Connection reset by peer: The connection is terminated by the remote end with a reset packet. Usually, this is a sign of a network problem, timeout, or service overload.)
predicate Error reading from client while waiting for header
message-syslog warning rhttpproxy [Originator@6876 sub=Proxy Req 18143] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE(Connection timed out)
predicate Error reading from client while waiting for header
There is also this in vCenter vpxd.log:
2021-02-09T13:02:54.803-09:00 warning vpxd [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c7-4215dae5] [FdmClientAdapter] Got vmacore exception when invoking csi.FdmService.GetDebugManager on smesx02.incendiary.local: Server closed connection after 0 response bytes read; <SSL(<io_obj p:0x00007fd2e43cc308, h:64, <TCP '10.0.10.100 : 45520'>, <TCP '10.0.10.12 : 443'>>)>
2021-02-09T13:08:41.339-09:00 warning vpxd [Originator@6876 sub=HTTP server] UnimplementedRequestHandler: HTTP method POST not supported for URI /. Request from 10.0.10.100.
2021-02-09T13:08:41.339-09:00 warning vpxd [Originator@6876 sub=HostGateway] State(ST_CM_LOGIN) failed with: HTTP error response: Bad Request
2021-02-09T13:08:41.339-09:00 warning vpxd [Originator@6876 sub=HostGateway] Ignoring exception during refresh of HostGateway cache: N7Vmacore4Http13HttpExceptionE(HTTP error response: Bad Request)
I'm not sure what else or where else to look.
I can find no actual connectivity issue, vCenter can communicate with the hosts on port 902 as validated by curl, ICMP works. Both vSwitch and distributed vSwitch portgroups are on the same VLAN, and other VMs on the distributed portgroup work without issue; of course connectivity between vCenter and hosts is still there as well. There's just some other communication failing.
Try the next:
Of course if the host that you want to decomission is not in the VDS you will need to add it and after the vNIC migration and a further vMotion you can safely remove it from there.
Yes, it seems I never migrated vCenter to the vDS. It is currently the only VM on a vSS. vCenter behaved the same on 3 different hosts, though I have since decommissioned 2 of those. Only this 1 host is left with a vSwitch in this configuration. This host is also on the vDS. My other hosts have their vSS on a different VLAN (my new management VLAN) for a secondary management vmkernel port only.
But, the same issue was occurring when I was trying to migrate vCenter to my new management VLAN on the same vDS (but different VLAN than current vSS it's on.) I originally thought the issue was related to re-IP'ing vCenter, and was not expecting the issue being related to simply migrating to vDS.
I have not (yet) tried migrating the uplink to my vDS.
It's interesting that ICMP connectivity is still there, and I can successfully curl from vCenter to the hosts.
I did already create an additional ephemeral portgroup and have tried migrating vCenter there with the same results. Though I didn't try migrating via Migrate Virtual Machines.
This host is a member of the vDS.
Migrate VMs wizard does report destination network is accessible.
But just like when editing network backing in vCenter, status gets stuck 99%. After a few minutes it times out with "An error occurred while communicating with the remote host," and all hosts go to Not Responding state.
My constant ping only gets a single dropped packet.
vCenter still successfully curls hosts on port 902, and pings are successful from vCenter to hosts and from hosts to vCenter when in this "Not responding" state.
May I know if after this movement of vCenter the destination network is the same for ESXis and vCenter? Maybe the 902 TCP port works but the 902 UDP is getting blocked. This of course if they sit in different networks.
Looking at your error on top it also says error on port 443 when trying to connect from the vCenter to the ESXis I presume. Maybe the issue you are facing is not related to heartbeat but the certificates not being able to get validated with the VMCA. Could you please confirm that you can reach port 443 TCP also?
@Lalegre The vSS portgroup and vDS portgroups are the same VLAN and network. vCenter and the hosts are on different networks, though I was having this same exact issue when attempting to migrate vCenter (appropriate portgroup on vDS and re-IP) to the same VLAN as the hosts. I initially thought the issue was due to re-IP of vCenter, and hadn't envisioned it was due to migrating to the vDS.
Firewall rules are allow any-any between these 2 networks.
I can curl from vCenter to the hosts on 443 as well.
The hosts still have the self-signed certificates, and curl returns the same regardless of which portgroup vCenter is on; but it is able to connect.
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
So I am getting a little confused here.
Did you do a re-ip of the vCenter Server before doing this change? Which procedure did you follow for doing it?
Have you also checked the port 902 but on UDP?
Originally, I was trying to re-IP vCenter which included migrating from current vSS portgroup on VLAN1 to vDS portgroup on VLAN10. Hosts are on VLAN10.
Did this process to migrate vCenter to vDS portgroup on same VLAN as hosts:
1: Change DNS record TTL to 5 minutes well in advance.
2: Create new portgroup on dvswitch that's configured as ephemeral.
3: Snapshot vCenter.
4: Access VAMI and update IP.
5: Delete old DNS record and create new with new IP.
6: Connect to host and update vCenter network adapter to new ephemeral portgroup.
7: Access vCenter console and "fix" management network by adding default gateway (this doesn't stick?)
8: Reboot vCenter.
All hosts went to not responding.
I would cat /etc/vmware/vpxa/vpxa.cfg and see that <serverIP> updated to new vCenter IP. Despite that, hosts were not responding even after several hours. Restored vCenter and everything was restored.
Tried that multiple times with same results, gave up.
Wanted to decommission this particular host, so started on this project of simply migrating vCenter from vSS portgroup to vDS portgroup, without changing VLAN/network/re-IP, but ran in to the same exact issue and results.
Neither UDP nor TCP is being dropped.
Side note: Finally migrated vCenter to VLAN10, same VLAN as hosts, this weekend by creating a new portgroup on the vSS of all my hosts. Have not yet tried migrating it to vDS again.
Fair enough. Seems that you tried both scenarios. Now with your explanation is everything more clear to me. Now that you have the vCenter Server in the same network it should work.
I did this in the past and I actually did not face any issue but maybe is something version related. Is your VDS updated to the last version?
Also what happened to the vCenter gateway is pretty weird. If that happens again then for sure you will face some issues but this depends on how do you have your ESXis connected to vCenter. If you have them connected by IP then there should not be any issue as they are in the same L2 but if you have them added with FQDN and your DNS Server sits on another network then you will have a L3 issue while trying to reach.
One temporary solution that you can perform is to add the estries of your ESXi to the /etc/hosts file till you migrate and after that you can configure the gateway again (Of course if this happens again)