gasdotit
Contributor
Contributor

All the ESXi servers in our farm get stuck at the same time

Hello everybody,

we're facing a really weird issue with our ESXi infrastructure (which is made of 3 Dell servers, hosting a total of 10 VMs, 9 linux boxes and a W2003 R2).

The whole stuff has been working perfectly for more than a year until last week when we upgraded the machines to ESXi v. 4.1.

Since then we're randomly getting the W2003 R2 server (which we use to access through Remote Desktop) completely freezed. It usually becomes available again in 25/30'....

What we experienced is that during those hangs:

  • we cannot access ANY ESXi server in our rack via vSphere client (connection couldn't be established or it gets freezed on "Loading inventory"). Not only the one hosting the W2003 server... all are unreachable (even though we can ping them):(

  • we cannot SSH to ANY of them on management interface (usually this works). See above:(

  • the linux boxes seems to be up&running w/out any issue:8}* .

Please note we even reverted the ESXi machine hosting the W2003 VM (DELL PE1950 with 16Gb of RAM) to v.4.0 Update 1 (as it was before the upgrade), but this didn't solve.

VMWareTools are installed and up to date on every VM and we cannot see any error message in event logs from vSphere client.

We're now digging about Window's issues, but what we think is very strange is we have that sort of mass effect on all the servers.

Thank you very much for any hint or help you could provide.

Marco

0 Kudos
4 Replies
FranckRookie
Leadership
Leadership

Hi Marco,

Welcome to the forums.

It is very bad luck to have a failure on all hosts at the same time! It looks more to be a problem with your network infrastructure. If I were you, I would inspect all my network devices and have a deep look at the logs when the problem arises.

Good luck!

Regards

Franck

jpdicicco
Hot Shot
Hot Shot

I have seen similar behaviour in a cluster after a LUN went missing. All hosts in the cluster looked for that LUN very hard every 30min, dropping all network I/O while they searched. Of course, the LUN was unavailable since it had lost a disk and was RAID0 :-(.

Check your storage, and good luck.



Happy virtualizing!

JP

Please consider awarding points to helpful or correct replies.

Happy virtualizing! JP Please consider awarding points to helpful or correct replies.
gasdotit
Contributor
Contributor

Hello Franck, hello JP,

I'm sorry I'm replying you so late, but I spent a lot of my time investigating the problem these days (& nights).

Thank you for your helpful advices, we're still digging the logs and checking iSCSI configurations...

But unfortunately we still did not found the issue until now...

We tried disabling Sparse LUN support, we checked for VT in BIOSs, checked network cabling and switches... but nothing :(.

I hope to be able to add some positive details to this thread as soon as possible!

Bye,

Marco

0 Kudos
gasdotit
Contributor
Contributor

Hello again,

after many days and nights spent with testing, reinstalling, switching network cables and switch's ports... we finally found the issue!

One of our virtual machines is a linux box installed with Suse 11.2, which acts as an openvpn / pptp server (and we access all the other server I mentioned in my original post THROUGH this VPN!).

Well, it seems that vmtools are causing kind of "networking" issues if coupled with openvpn/pptp.

As a matter of fact we uninstalled vmtools and... all is properly working again!

While we don't have any issue on other Suse 11.2 machines having vmtools and no openvpn/pptp.

Hope this can help if anybody else is facing a similar problem (and sorry for this late reply!).

Happy new year!

Marco

0 Kudos