Re: Loss of connectivity to ESX host and VM networ...

f1racr · ‎05-07-2009

Hi all,

I have been dealing with this issue recently and thought it was possibly due to my bad configuration since this is my first attempt at a VI 3.5 implementation, but as I've been spending time on it, I'm convinced it's got to be something oddball, but I'm hoping you might be able to point me in the right direction.

OK I have 2 x Intel Dual Xeon 3.0Ghz Servers. Each with 8GB Ram. Local Mirrored 36GB drives to run the bare metal ESX and an Infortrend S12E - R1132-34 iSCSI SAN attached to both servers.

I am only running a total of about 8 VM's, 6 Windows 2003 Servers and 2 x Windows XP.

Now the odd part. Everything works fine during the day. I have HA and DRS enabled with a single resource pool called Production with all of the VM's inside the one pool and both of my ESX servers in the same pool. During the day the servers move around as needed and I can put each server in maintenance mode and all of the VM's move to the other server successfully, so from that I can only assume that things are as they should be as that side of things seems to work well.

I have 3 dual head Pnics in each server with only one port used on each at this point (waiting on larger switch).

Pnic 0 VM network and service console

Pnic 2 iSCSI and service console

Pnic 4 vmotion

This setup is the same on both servers. Just FYI, I recently had the iSCSI running on the same subnet as the VMnetwork and service console just for ease of initial testing and had this issue so I thought I'd move the iSCSI traffic off to a different subnet with it's own service console but it made no difference to the issue, but at least it's setup a bit more realistically now anyhow.

OK so this is the real issue. It seems to have only come about since I enabled HA and DRS, but I can't be sure........ When I do a backup of the VM's each night via our backup server (Backup Exec 12.5) is fails on backing up the main file server VM. Seems to get a variable amount of the way through the backup (sometimes 1 % other times 40%) and then loses connection to the VM. When I connect to VCenter and have a look the resource pool shows my first ESX server as 'not responding' and the VM's that were running on it as disconnected !!

I can connect to the console of the 'not responding' ESX server and tell it to restart, but it starts shutting down and appears to hang during the shutdown process. Also if I try to connect to the server with VI Client, it says it's connecting and loading inventory and then times out.

The odd part is that if I put the file server VM on the second ESX server then things seem to go OK, I can get a full backup of all of the VM's......... weird. So it does seem related to the first ESX server in some way, but I'm lost as to how and why.

I might try removing the server from the resource pool and letting it run and see if that makes a difference, but it's taken me till now to figure out that it only seems to happen when the file server is based on the first ESX server so I've not tried too many different iterations.

All NICS are the same Intel 82546EB PCI devices.

Any ideas as to what I should be looking at ? Anyone seen anything like this before ?

Just a couple of notes. Once I get the ESX host up and running things are all good, so it doesn't break anything, just annoying as it seems HA won't restart the VM's on the other ESX host as the original ESX host is still responding in some way, just not fully operational. I've used update manager via VCenter to make sure the ESX hosts are all up to date and was hopeful that would fix it, but nope

Any help is much appreciated.

kjb007 · ‎05-07-2009

By backing up, do you mean you have backup exec inside of the vm itself, or a separate backup server, or through vcb?

Backing up is typically heavy I/O over the network, and you service console is used for communication through vCenter. If you only have the 3 below NICs to work with, I would separate iSCSI, vm network, and put vmotion and service console together. That way you shouldn't interfere with management functions if you're utilizing the network heavily. But, this still should not be large enough of traffic to fill a Gigabit network. Have you checked network settings to make sure you're connecting at 1 GB Full duplex? I'll assume you have a GbE network. If yes, then I would try and force each physical network to 1 Gb Full duplex from the ESX side and the switch side.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-07-2009

Sorry, should have been clearer.

I am using a seperate physical server running Backup exec to connect to my VM's to backup the data. I have a seperate process that runs a VCB to backup the entire VM's on a weekly basis, but the VCB seems to be working OK.

kjb007 · ‎05-07-2009

Ok, is your vcb server mounting using nbd mode, or connecting iSCSI directly to the target to mount the disk files? I still suggest separating your service console and your vm networks, for security, if nothing else. I'd check speed and duplex next.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-07-2009

VCB is using nbd as I tried to get iSCSI method going, but couldn't and didn't have time to investigate further so just set to nbd mode in the meantime.

The issue though is with my physical backup server talking to my VM's rather than the VCB, but I assume you're trying to see if the VCB was in fact not really doing a similar thing.

Seems to be load related as it's perfect under normal load during the day, but I think when the heavy load comes on during backup it can't cope for whatever reason.

I'll have a look at duplex and stuff, but I'm not convinced it's related to that.

I'm going to remove it from the resource pool tonight, but leave the file server and a couple of others on the ESX host tonight and see if it craps out again or succeeds when not in the resource pool.........hopefully that might help point us in a closer area.

Thanks heaps for your replies so far.

f1racr · ‎05-18-2009

Well just wanted to update this post for those who are searching for similar issues.

When I was looking I found very little in the way of information regarding this or similar issues.

It might be a bit early to say I have definately fixed the issue, but things are working well now.

First I removed the ESX host that was "not responding" after each night from the cluster and ran it as a stand alone host and it seemed to work well and the backup succeeded. Putting the ESX host back into the cluster saw the issue return with the ESX host and any VM's running on it basically ceasing to function. At the same time as we were looking into this issue we had another issue with the pServer running the backup software was having issues with the network card. I have replaced both of the network in that pServer and also left the ESX host as part of the cluster and the backup succeeded last night without hassles, so it's looking like the NIC's in the pServer running the backup was somehow causing enough of an issue with the ESX host that it was causing some form of crash which caused the "Not responding" / disconnect from Vcenter. Weird I know, but it certainly seems to have made things better as the backups are completing in approx half the time it was before also, so I think it's clear there was an issue with the NIC's in the backup server.

I'll post back here again after we've had a good week of reliable operation as that will be the final proof that the NIC's were the issue.

So on the surface of it, even the NIC's in a seperate server can cause issues on an ESX host.......I ruled out the possibility of another physical servers issues somehow crashing an ESX host, but it seems it's possible.

kjb007 · ‎05-18-2009

Did the esx host actually go down, or was it just not responding? If the NIC was flooding the ESX NIC, then it could have caused an issue with esx response back to vCenter. Glad you found some sort of resolution, even a temporary one for now.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-18-2009

As mentioned in the original post, the ESX host was showing as not responding in Vcenter, any VM's running on it were down and also showing as not available, however I could still connect to the console via SSH and request a restart of the server which it responded to and when the ESX host came back up all was well again, VM's restarted and ESX host started to respond........... worked perfectly through the day with it's daily load, but when the backup came back around at midnight the same thing happened again !

Weird I know, but true.

There didn't seem to be a flooding of the switch after the disconnect from the backup when I looked at the logs.......

f1racr · ‎05-23-2009

Well it's been a few days now and after removing the old Gig network cards in my physical backup server seems to have solved the issue of one of my ESX hosts becoming unresponsive to vCenter server and the VM's running on that ESX host also becoming unavailable.

I've now had 4 full, clean backups of all of my VM's using Backup Exec at a file level and I've also conducted a full VCB backup of all VM's also with no hassles...... seems like problem solved.

f1racr · ‎05-25-2009

Well it seems I might have spoken a little too soon

Woke up this morning to the same issue........ bugger........ have taken a snapshot to show you what is happening when you look at vCenter in case it makes sense to someone. Only seems to happen to the 192.168.4.30 server rather than 31 which I suppose points to something, but they are both Intel Servers with 3 x Dual Intel Pro/1000 NICS so identical hardware, but I suppose it's possible that the pNic in the server could be faulty, but I'm pretty sure I've changed it for one of the other pNics already, but can't be sure so I suppose I need to do it again to test for sure.

Any ideas on this would be much appreciated as it's run faultlessly for over a week now, then bam !

http://i87.photobucket.com/albums/k123/f1racrnz/Other/PSL.jpg[/IMG]

kjb007 · ‎05-25-2009

What about processors? Have you ever replaced a processor in this system? Can you run 'cat /proc/vmware/cpuinfo' and make sure the family/model/type/stepping all match up?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-25-2009

Server 1 (192.168.4.30)

pcpu 00 01 02 03

family 15 15 15 15

model 02 02 02 02

type 00 00 00 00

stepping 09 09 09 09

Server 2 (192.168.4.31)

pcpu 00 01 02 03

family 15 15 15 15

model 02 02 02 02

type 00 00 00 00

stepping 05 05 05 05

So they are slightly different, but as mentioned VMotion works perfectly and both servers have been working perfectly for a week with no hassles and I'm testing now, but it seems that the server works perfectly when not in a Cluster / Resource pool......... I won't be able to confirm that for a week.

kjb007 · ‎05-26-2009

As long as they match within the server, you should be okay. Prior to the errors, were there any vmotions?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-26-2009

Well this is where I think I'm starting to track things down to. Not so much vMotion, but the placement of specific VM's on specific ESX Hosts............... I can't be 100% sure, but from memory everytime the "File/print/DC" VM moves onto the 192.168.4.30 ESX host it seems to cause trouble.........

The Exchange and File/Print/DC hosts moved from 192.168.4.31 to 192.168.4.30 at 3.30pm the afternoon before the issue reappeared......... previous to that when they were both running on 192.168.4.31 things seemed to succeed with no hassles............ strange huh.

Is it possible that something in that specific VM is causing the issue ? They were all built from scratch and weren't imported, it was a brand new build and all of the other servers running (5) are made from the same template, but I guess it's possible something could have been corrupted ? But that wouldn't explain why it seems (yet to be 100% sure) to work faultlessly on the other ESX host........

Thanks so much for your help in this matter........ it seems no one else is following this and so I really appreciate any insight you might have.

Fraser

kjb007 · ‎05-26-2009

Can you try one more thing, and move your vm network with your vmotion pNIC? When backups happen over NBD, they will be talking over service console interface, and ESX may not like the additional load.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎05-26-2009

OK I will give that a go.

Should I put the server back into the cluster to try this or should I leave it outside ? Seems to work OK if server isn't in the cluster so I suppose I need to put it under the same pressure that made it fail to be sure we're resolving the issue I guess.

kjb007 · ‎05-26-2009

Right, leave it in the cluster.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎06-04-2009

Well I know I've said this before, but I really do feel much more confident that the issue is resolved this time.

Even though I have 6 seperate pNics in the machine I don't have a larger enough switch to run it all through seperate cards and so instead of seperating the VM network and the Service Console on seperate Nics, I just rate limited the VMNetwork to 95% instead of 100% and it solved the issue instantly. I've had over a week of successful backups and VCB's as well and faultless.

My new switch will be here soon and I will put it on it's own pNic to do it properly................

So the moral of the story is always make sure you seperate VM network, VMotion, iSCSI and Service console and if you can't then rate limit it a little bit to just allow the heartbeats to keep running during heavy backup load !!

Thanks you sooooo much kjb007, you've helped point me in the right direction on more than one occasion :smileygrin:

kjb007 · ‎06-04-2009

You're very welcome

Don't forget to leave points for helpful / correct posts.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

f1racr · ‎06-04-2009

Alas, once again I spoke too soon

Same issue again this morning, but this time I had a post on the console which related to loss of connection to the iSCSI controller and it was struggling to contact it's alternative controller also, so it seems it's either something to do with the SAN itself (although the other server connected to the same SAN stays running, so probably not) or somehow the iSCSI side of the ESX host falling over......

I'm going to have to do some more investigation.

Is there any way to restart the iSCSI side of things without a full reboot ?

All

Loss of connectivity to ESX host and VM network during backup