VMware Cloud Community
kaychan
Contributor
Contributor

ESX hosts continuing dicsonnected from Virtual Center Server

We have two ESX hosts at our DMZ zone with a firewall running between them and the Virtual Center Server. The ESX hosts have been managed under the Virtual Center Server since September 2007

Today, we applied the latest November 2007 patches to 1 of the server and after the reboot we found that the hosts will not stay connected with our virtual center server. We can connect it manually and it will stay connected for about 30 seconds to 1 minute and then disconnected again.

Without doing any changes on the 2nd host, we did a reboot and it exhibit the same symptoms (keeps disconnecting from Virtual Center)

The two ESXT hosts are both running under ESX 3.01, 1 at patch level 42829 and 1 at patch level 60875. The virtual center server is running VC 2.01

We checked with the network team and they said all the ports that we needed for VI3 are opened:

VC --> ESX host:902/tcp

VC --> ESX host:903/tcp

ESX host:902/udp --> VC (UDP heartbeat)

ESX host --> VC:27000/tcp

ESX host --> VC:27010/tcp

We have other ESX hosts running under our coporate network without the firewall. We just patch 1 host to 60875 and this host has no connection problem. Everything is running fine.

Please help!!

0 Kudos
42 Replies
kaychan
Contributor
Contributor

I tried that and again no luck.

0 Kudos
lholling
Expert
Expert

Hi Kaychan

In your first message you mention that you are using VC v2.01 you cannot use that version to manage v3.0.2 ESX servers you MUST upgrade VC to v2.0.2 it is a pre-requisite.

After you have upgraded use the following link to fix the web access component of VC (it is still broken for 2.0.2 Update 2)

http://communities.vmware.com/thread/110156

Once you have done this a new agent will be copied down to the ESX servers and this will probably fix your problem.

Leonard...

-


Don't forget if the answers help, award points

---- Don't forget if the answers help, award points
0 Kudos
kaychan
Contributor
Contributor

In y first message, I mentioned that my ESX servers are running under ESX 3.01

0 Kudos
Chamon
Commander
Commander

Kaychan,

I just tested this with our test ESX server and was able to get the same message (not responding) that you are getting by disabling the vpxHeartbeats on the local ESX server firewall. Can you ensure that your local ESX firewall did not close port 902 by running

esxcfg-firewall -q vpxHeartbeats

If you get the responce

service vpxHeartbeats is Blocked

you need to re-enable it

esxcfg-firewall -e vpxHeartbeats

To check your firewall between the VC and ESX host:

I don't know if you can do this but ..........

if you enable the telnetClient on the local firewall of your esx that wont comunicate with the VCenter then you can test your firewall and the access to your ESX from your VC by telnet esx_host_ip 902. This depends on your firewall rules also.

I tried it with ours and it will not let you authenticate but it will test your connection through the firewall.

esxcfg-firewall -e telnetClient (as usual case specific)

From your VC server open a command prompt

telnet ip_of_esx_host 902

if you get an message from the esx server stating that:

220 VMware Authentication Daemon version 1.10: SSL Required

then you would know that your firewall port 902 VC==> ESX Host is definitely open.

Form esx host try

ssh -p 902 ip_vcenter_server

Security hole but it will verify your previous concerns that the port is not open on your firewall.

Chamon Added a few lines.

0 Kudos
kaychan
Contributor
Contributor

Chamon,

Thank you ans appreciate your help. I checked with your suggestions and confirmed that port 902 is opened at the firewall.

I have submitted a request to our network team to request them to check the network configuration of the vmics. It seems that vmnic0 (used by the Service Console) is configured under the same network group as the vmnics1 & 4 which are under vSwitch1 (used by the Vms). I am wondering whether this would cause a routing problem.

I can ping the ESX host from the VC server and that could explain why I can connect when I tried to add the host. But the host is disconnected and becomes "not responding" after about 30 secs and I am wondering whether that could have caused by "routing problem".

I have tried so many things as suggested by other forum members but no luck with this. I hope the network team would review and correct the configuration of the vmnic0 and that would help to resolve the problem.

0 Kudos
jonathanp
Expert
Expert

Hold on.. I had this problem like a week ago, when I was trying to make a kickstart script with a post section script that was linking the vmnic6 (in our case) to the Service console switch...

After we ran this script the server became unresponsive...

I finally found after thinking my script that was bad , it was caused by the physical switch where they are connected cause at beging this was a ESX 2.5 host that we migrated to ESX 3.0 so the Service Console on our ESX 3.0 are not using the same vmnic and it was configured for a different network on ESX 2.5. So We will change this but for now I just unlinked the vmnic6 from Service Console and all is working well, and it was instant.

So our vmnic6 and vmnic0 had a conflict in the way the physical switch port are configured or connected.. (I'm not onsite) so they will change the cabling to be like our ESX 3.0...

Try to use only 1 vmnic for the Virtual Switch that the Service console is on..

esxcfg-vswitch -l (to see uplink to the switch)

esxcfg-vswitch -U vmnic# vSwitch0 (to unlink vmnic#)

esxcfg-vswitch -L vmnic# vSwitch0 (to link vmnic# )

So try to have one switch for VM and one for Service console...

like

vSwitch0 - linked to vmnic 0 and 1 "Service Console"

vSwitch1 - linked to vmnic 2 and 3 "Production"

as exemple

Hope this will help

Jon

0 Kudos
kaychan
Contributor
Contributor

Thanks Jon.

This sounds very encouraging. I think our problem is that vmnic0 (assigned to SC) is misconfigured and it is place under the same network as vSwitch1. I have sent a requets to the networm group to reconfigure it so that the network of vSwitch0 and vSwitch1 would not be mixed. Hope I will hear from them soon.

0 Kudos
FCE
Contributor
Contributor

We had the same problem with our ESX 2.5.4 host in the DMZ after upgrading to VC 2.0.2. In our case the only solution that ever fixed the problem was to move the Service Console NIC (and only the service console NIC) out of the DMZ and back into the non-firewalled network. The VMs on that box still only had access to the DMZ network. Everything worked fine with earlier versions of VC and the firewall logs showed no blocked packets but the behavior was exactly as you described. When I setup our new 3.0.2 servers in the DMZ I just built the service console NIC on the internal network from the beginning so I wouldn't face this issue.

0 Kudos
Chamon
Commander
Commander

Kaychan,

How did this work out? Was it an issue with your NIC?

0 Kudos
kaychan
Contributor
Contributor

It seems to us that the vmnic0 (used by the ESX service console) is put under the wrong IP range. It is out under the same IP range that the VMs are using. We are discussing this with out network team and see what they could do.

I will keep you posted.

0 Kudos
Chamon
Commander
Commander

Is the service console default gateway correct?

0 Kudos
kaychan
Contributor
Contributor

The service console's network configuration in virtual center is correct but it seems that physically it is connected to the VLAN used by the VMs, not its own VLAN.

0 Kudos
RaniBaki
Contributor
Contributor

I am having exactly the same problem. Unfortunately I can't use a different vlan for my vm network becuase we're using blades. The whold enclosure resides on the same vlan. I sure hope this is not the reason. Please let me know if you make any progress on this. I'm probably going to open a case with vmware.

0 Kudos
Chamon
Commander
Commander

So that sounds like the Trunking on the port(s) that the service console NIC

is plugged into is not correct.

0 Kudos
kaychan
Contributor
Contributor

We have also opened a case with VMware but the technical support does not seem to have any idea. He seems to have no knowledge on the virtual network side. We have been getting more support and knowledge from this forum than from them.

0 Kudos
Chamon
Commander
Commander

Can you request to have the SC NIC moved to a different port on the pswitch and have that one configured with the proper VLAN. Maybe the port on the pswitch is bad or is shutting down after a short time.

0 Kudos
RaniBaki
Contributor
Contributor

I am going to move the sc to a different vlan tomorrow. I will let you know the outcome.

0 Kudos
RaniBaki
Contributor
Contributor

I moved the sc to a different vlan and it still did not work. After 30 seconds, the ESX server still disconnects.I wish someone from VMWare would shed some light on this.

0 Kudos
kaychan
Contributor
Contributor

Sounds like you have the same problem that we have. Our network team still have not configured the SC vmnic0 for us so it is still under the same network as the other vmnics allocated to VMs.

Please let me know what your VMware support said about this problem. I logged this case to Vmware and unfortunately the technical support person has no knowledge on network and he does not have any idea of what is the cause of the problem. Even when I showed him our network misconfiguration, it took me two times to explain to him before he understands. Then he just said that our concerns is legitimate because the misconfiguration of vmnic0 would have caused some routing problem. I do not hear further suggestion from him anymore. Rather disappointed at their technical support.

0 Kudos
dcurtis
Contributor
Contributor

I had the same problem, in the end we found it was because the 'networking' guys had opened all the correct ports but for TCP only. Opening them for UTP as well fixed it.

Hope this helps

0 Kudos