Migrating from Virtual Switch to Distributed fails...

TryllZ · ‎05-15-2020

Hi,

To begin with I'm observing lately that the last 2 versions of Workstation has a lot of connectivity issue. This specifically occurs snapshots are taken and restored, and/or when VMs are paused and resumed (after some hours) the VMs do not connect and require manually ping each and every interface to resume connectivity. Also I observed this occurs where routing is involved as I did not face any issues when all VMs are on the same network.

I'm migrating from Virtual Switch to Distributed Switch in vCenter, the esxi hosts have dual vmnics and dual vmk adapters.

I have 2 Clusters with following esxi hosts.

Windows Server (DNS) - 192.168.10.2

vCenter - 192.168.10.5

Compute Cluster

* compute1.v.lab - vmk0 (192.168.30.10), vmk1 (192.168.30.11)

* compute2.v.lab - vmk0 (192.168.30.20), vmk1 (192.168.30.21)

Infrastructure Cluster

* infrastructure1.v.lab - vmk0 (192.168.20.10), vmk1 (192.168.20.11)

* infrastructure2.v.lab - vmk0 (192.168.20.20), vmk1 (192.168.20.21)

* infrastructure3.v.lab - vmk0 (192.168.20.30), vmk1 (192.168.20.31)

I'm following this https://www.youtube.com/watch?v=eDJ3OfXTkLs for migrating via GUI, however, for some reason some esxi hosts migrated successfully while others failed. Both set of esxi hosts have similar configuration, in my situation the compute cluster esxi hosts migrated successfully, while infrastructure cluster esxi hosts failed with the following error.

Both vmk0 and vmk1 can ping vCenter, and vCenter can ping both interfaces of the esxi host as well.

[root@infrastructure1:~] vmkping -I vmk0 192.168.10.5

PING 192.168.10.5 (192.168.10.5): 56 data bytes

64 bytes from 192.168.10.5: icmp_seq=0 ttl=63 time=0.992 ms

64 bytes from 192.168.10.5: icmp_seq=1 ttl=63 time=0.724 ms

64 bytes from 192.168.10.5: icmp_seq=2 ttl=63 time=0.720 ms

--- 192.168.10.5 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.720/0.812/0.992 ms

[root@infrastructure1:~] vmkping -I vmk1 192.168.10.5

PING 192.168.10.5 (192.168.10.5): 56 data bytes

64 bytes from 192.168.10.5: icmp_seq=0 ttl=63 time=0.731 ms

64 bytes from 192.168.10.5: icmp_seq=1 ttl=63 time=0.895 ms

64 bytes from 192.168.10.5: icmp_seq=2 ttl=63 time=1.497 ms

--- 192.168.10.5 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.731/1.041/1.497 ms

PS C:\Users\Administrator> ssh root@192.168.10.5

Command> ping 192.168.20.10

PING 192.168.20.10 (192.168.20.10) 56(84) bytes of data.

64 bytes from 192.168.20.10: icmp_seq=1 ttl=63 time=1.06 ms

64 bytes from 192.168.20.10: icmp_seq=2 ttl=63 time=1.19 ms

64 bytes from 192.168.20.10: icmp_seq=3 ttl=63 time=0.686 ms

64 bytes from 192.168.20.10: icmp_seq=4 ttl=63 time=0.833 ms

^C

--- 192.168.20.10 ping statistics ---

4 packets transmitted, 4 received, 0% packet loss, time 9ms

rtt min/avg/max/mdev = 0.686/0.942/1.194/0.196 ms

Command> ping 192.168.20.11

PING 192.168.20.11 (192.168.20.11) 56(84) bytes of data.

64 bytes from 192.168.20.11: icmp_seq=1 ttl=63 time=0.798 ms

64 bytes from 192.168.20.11: icmp_seq=2 ttl=63 time=1.11 ms

64 bytes from 192.168.20.11: icmp_seq=3 ttl=63 time=1.13 ms

64 bytes from 192.168.20.11: icmp_seq=4 ttl=63 time=0.689 ms

^C

--- 192.168.20.11 ping statistics ---

4 packets transmitted, 4 received, 0% packet loss, time 33ms

rtt min/avg/max/mdev = 0.689/0.931/1.129/0.195 ms

vCenter Log

2020-05-15T08:05:11.155Z info vpxd[15507] [Originator@6876 sub=HostCnx opID=CheckforMissingHeartbeats-74856499] [VpxdHostCnx] No heartbeats received from host; cnx: 5219c9b7-862a-17bf-2de1-2b471e0435a1, h: host-1029, time since last heartbeat: 2634158ms

2020-05-15T08:05:11.156Z info vpxd[15507] [Originator@6876 sub=HostCnx opID=CheckforMissingHeartbeats-74856499] [VpxdHostCnx] No heartbeats received from host; cnx: 52fdf7fe-5824-585f-d256-f62222ad4478, h: host-1026, time since last heartbeat: 2633936ms

2020-05-15T08:05:33.379Z info vpxd[32047] [Originator@6876 sub=HostGateway] CmConnectionFSM::RunFSM(ST_CM_CALL_FAILED)

2020-05-15T08:05:33.413Z warning vpxd[15514] [Originator@6876 sub=HTTP server] UnimplementedRequestHandler: HTTP method POST not supported for URI /. Request from 192.168.10.5.

2020-05-15T08:05:33.431Z error vpxd[32047] [Originator@6876 sub=HostGateway] [CisConnection]: ComponentManager->LoginByToken failed: HTTP error response: Bad Request

2020-05-15T08:05:33.431Z warning vpxd[32047] [Originator@6876 sub=HostGateway] State(ST_CM_LOGIN) failed with: HTTP error response: Bad Request

2020-05-15T08:05:33.569Z warning vpxd[14404] [Originator@6876 sub=HTTP server] UnimplementedRequestHandler: HTTP method POST not supported for URI /. Request from 192.168.10.5.

2020-05-15T08:05:33.571Z error vpxd[32047] [Originator@6876 sub=HostGateway] [CisConnection]: ComponentManager->LoginByToken failed: HTTP error response: Bad Request

2020-05-15T08:05:33.571Z warning vpxd[32047] [Originator@6876 sub=HostGateway] State(ST_CM_LOGIN) failed with: HTTP error response: Bad Request

2020-05-15T08:05:33.678Z warning vpxd[32043] [Originator@6876 sub=HTTP server] UnimplementedRequestHandler: HTTP method POST not supported for URI /. Request from 192.168.10.5.

2020-05-15T08:05:33.680Z error vpxd[32047] [Originator@6876 sub=HostGateway] [CisConnection]: ComponentManager->LoginByToken failed: HTTP error response: Bad Request

2020-05-15T08:05:33.680Z warning vpxd[32047] [Originator@6876 sub=HostGateway] State(ST_CM_LOGIN) failed with: HTTP error response: Bad Request

2020-05-15T08:05:33.795Z warning vpxd[32044] [Originator@6876 sub=HTTP server] UnimplementedRequestHandler: HTTP method POST not supported for URI /. Request from 192.168.10.5.

2020-05-15T08:05:33.797Z error vpxd[32047] [Originator@6876 sub=HostGateway] [CisConnection]: ComponentManager->LoginByToken failed: HTTP error response: Bad Request

2020-05-15T08:05:33.797Z warning vpxd[32047] [Originator@6876 sub=HostGateway] State(ST_CM_LOGIN) failed with: HTTP error response: Bad Request

2020-05-15T08:05:33.800Z warning vpxd[32047] [Originator@6876 sub=HostGateway] Ignoring exception during refresh of HostGateway cache: N7Vmacore4Http13HttpExceptionE(HTTP error response: Bad Request)

-->

2020-05-15T08:05:49.459Z info vpxd[14459] [Originator@6876 sub=Health] Wrote vpxd health XML to file /etc/vmware-sca/health/vmware-vpxd-health-status.xml. Status: YELLOW. Expiration: 5267

2020-05-15T08:05:51.918Z info vpxd[14446] [Originator@6876 sub=vpxLro opID=q-1952:h5ui-getProperties:urn:vmomi:HostSystem:host-1032:41c6ed99-ab4e-45ef-af71-e6993ff2ddda:1349277837:01-61] [VpxLRO] --

2020-05-15T08:06:23.037Z info vpxd[14767] [Originator@6876 sub=vpxLro opID=sps-Main-496359-494-437f-31] [VpxLRO] -- FINISH lro-9879

2020-05-15T08:06:26.069Z info vpxd[14436] [Originator@6876 sub=VapiEndpoint.HTTPService.HttpConnection] HTTP Connection read failed while waiting for further requests [N7Vmacore4Http14HttpConnectionE:0x00007f5004851010]: N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: <io_obj p:0x000055917ab251c8, h:-1, <TCP '127.0.0.1 : 8093'>, <TCP '127.0.0.1 : 38974'> FD Closed>, duration: 00:00:45.995834 (hh:mm:ss.us))

--> [context]zKq7AVECAAAAAK9N4wAMdnB4ZAAAHHMubGlidm1hY29yZS5zbwAAh9QZAPvxGACHWxYAH1QYAM0DJQDqFiMAbL0iAOANIwBeWioBB3wAbGlicHRocmVhZC5zby4wAAIfKQ9saWJjLnNvLjYA[/context]

2020-05-15T08:06:26.069Z info vpxd[14527] [Originator@6876 sub=VapiEndpoint.HTTPService.HttpConnection] HTTP Connection read failed while waiting for further requests [N7Vmacore4Http14HttpConnectionE:0x00007f4ffc0a2670]: N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: <io_obj p:0x00007f50043c2368, h:-1, <TCP '127.0.0.1 : 8093'>, <TCP '127.0.0.1 : 38998'> FD Closed>, duration: 00:00:45.971136 (hh:mm:ss.us))

--> [context]zKq7AVECAAAAAK9N4wAMdnB4ZAAAHHMubGlidm1hY29yZS5zbwAAh9QZAPvxGACHWxYAH1QYAM0DJQDqFiMAbL0iAOANIwBeWioBB3wAbGlicHRocmVhZC5zby4wAAIfKQ9saWJjLnNvLjYA[/context]

2020-05-15T08:06:27.136Z info vpxd[14384] [Originator@6876 sub=vpxLro opID=opId-c716a-2210-54] [VpxLRO] -- BEGIN lro-9881 -- SessionManager -- vim.SessionManager.sessionIsActive -- 52262bbd-e867-b55c-a353-211fcbd234e7(52469f4d-189b-1837-f76d-c689feac716a)

2020-05-15T08:06:27.137Z info vpxd[14384] [Originator@6876 sub=vpxLro opID=opId-c716a-2210-54] [VpxLRO] -- FINISH lro-9881

Thanks..

scott28tt · ‎05-15-2020

Moderator: Thread moved to the vSphere vNetwork area.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

TryllZ · ‎05-18-2020

An update on this, I tried migrating on a fresh setup by having dual vmnics and 1 vmk, and surprisingly it migrated successfully.

Sadly it only did this once, its like vCenter does it randomly sometimes it works sometimes it doesn't, sometimes it works on one host and it doesn't on another, or am I not following the right way.

Can I know if there is a standard way to migrate from vSwitch to DVSwitch ?

Thanks..

scott28tt · ‎05-18-2020

This may help: How to Do the Old Switcheroo: Migrating vSS to vDS with Zero Downtime - VMware on VMware

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

TryllZ · ‎05-23-2020

Thanks scott28tt

the link was helpful in understanding the migration.

I finally came to find out the problem (no solution found yet).The issue is with one of the vmnics in each esxi host not migrating and causing the issue, why it is so I fail to understand. I came to know this by manually migrating each vmnic individually until the last vmnic3 which has the same status (unconfigured or attached to any switch) as vmnic2. Surprisingly vmnic2 migrated successfully while vmnic3 fails, not sure why.

This only happens when the deployment is done where the esxi hosts are in a different subnet than vCenter and a router is used as vCenter logs show a lot of missed heart beats. Where the esxi hosts and vCenter are in the same subnet everything works fine.

Any thoughts, thanks.

All

Migrating from Virtual Switch to Distributed fails even with Redundant Uplinks..