VMware Cloud Community
ncarty97
Contributor
Contributor

Losing all network connectivity on my host and VMs - ESXi 5.1

Hello,

Appreciate anyone that can help point me in the right direction to get this solved.  I keep having a problem where my whole system (host and VMs) loose all connection to the network.  If I look at the monitor connected to the host, it shows everything is fine (or rather there is no change from when it finishes booting up).  I've switched out both routers and switches, so its not an issue there.  I can't identify any particular program running or anything that would be causing this, it seems almost random (I can go multiple weeks without it happening, then have it two or three times in one day).  I've attached the log, but I'm not even sure that's the right log.  Its pretty huge and I have no idea what I should be looking for.  Any help would be greatly appreciated!

Thanks

Some information on the system:

CPU:  AMD FX-8320

Motherboard:  ASRock 970 Extreme4

Memory:  DDR3 PC3-10700H 16GB (8GBx2)

NIC:  Onboard Realtek NIC disabled, using an Intel Dual NIC (Was having this problem when just using the onboard NIC as well)

VM's

1.  WHS 2011 running FlexRAID (passing through an LSI Logic card for the drives)

2.  Windows 7 64-bit, mainly running Emby Server for media streaming

10 Replies
ncarty97
Contributor
Contributor

Nothing?

Seriously?

Reply
0 Kudos
ncarty97
Contributor
Contributor

One more try.  Surely in the vast community of people here, SOMEONE has an idea of what to check!?!

Reply
0 Kudos
jgotway01
Contributor
Contributor

I had the same symptoms a while back with a switch that was going out. It would work for a while and then cause a loss of connection at random to all VM's that were on the host. It was actually a switch supporting a iSCSI storage network. Since you have already replaced the switches it may be something related to switching besides the hardware. My first question to you would be, are you using shared storage network? If so is it iSCSI?

unsichtbare
Expert
Expert

I suggest disabling (or removing from the vSwitch) one of the two ports (vmnics) at a time.

First remove vmnic0 and then test things. Then remove vmnic1 and test things. If removing either vmnic improves the situation, it is your physical switchports that are mis-configured. Be prepared to use esxcli on the console/shell to remove/add physical nics, in case you get disconnected.

VMware KB: Configuring vSwitch or vNetwork Distributed Switch from the command line in ESXi/ESX

ESXi default falover detection policy is "link status only", so if a vmnic is connected, but the port it is connected to is not working correctly, you may see loss of connectivity. Moreover ESXi default load balancing policy "originating port ID" exacerbates apparent loss of connectivity as it routes every virtual nic (including vmkernel) over one vmnic and one vmnic only. BTW: "originating port ID" works great when physical switchports are correctly configured.

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/
ncarty97
Contributor
Contributor

I assume by shared storage you mean sharing a Virtual Disk where two or more VM's have direct access?  If so, I'm not doing that.  The only place I am sharing storage is that my WHS2011 VM has a passed through IBM M1015 card (I flashed per the instructions here LSI Controller FW updates IR/IT modes that houses an array (8 drives, using DrivePool for pooling and FlexRAID for parity RAID).  I share that drive using the standard Windows sharing, not anything through vSphere.  The controller I see for this VM says  "SCSI controller 0" with summary of "LSI Logic SAS".  Is that where I would look for iSCSI?  I've poked around but I don't see anything where I could set it as iSCSI.

And thanks for your help, was really getting frustrated!

Reply
0 Kudos
ncarty97
Contributor
Contributor

Thanks, I will give this a try as well!

Reply
0 Kudos
jgotway01
Contributor
Contributor

Ok. If you were using shared storage via iSCSI over Ethernet, your vSwitches in vSphere would have to be set for jumbo frames at 9000 MTU opposed to the default 1500 MTU. I have seen a host lose connectivity before because it was improperly configured. But it appears that you are not using a vSwitch in that manner. Are you doing any type of activities on the network that are disk intensive when the problem occurs?

Reply
0 Kudos
ncarty97
Contributor
Contributor

The only thing I can think of would be that it does seem to occur more often when I am using one VM to encode video using handbrake.  The videos are on the WHS2011 VM and I share that folder using the regular Windows sharing. 

But its not like that is the only time it occurs.  It occurs plenty of times when I'm not doing anything at all (IE middle of the night).  FlexRAID updates middle of the night, but that's usually the only over night process going on.  I've had it before streaming video via Emby to three different devices and not had a problem at all.

Reply
0 Kudos
jgotway01
Contributor
Contributor

Have you considered separating your storage traffic? Use a NIC for your regular network traffic and a NIC for your storage traffic? If your array supports iSCSI it would be a big help. Then you would only have to setup a software iSCSI adapter with a new vSwitch in vSphere and possibly a new switch to isolate the networks. I would avoid converging your storage traffic and regular traffic. It might be the cause of your issues.

Reply
0 Kudos
ncarty97
Contributor
Contributor

Well, I finally figured it out.  Probably could have found another way to do it, but here's what it is:

Either the CPU or Motherboard are bad.  Its specific to a couple of the SATA ports on the motherboard, the others aren't giving any problems.  I had a my main SSD I was using on for my VM's on that port.  I took a really round about way to determine it.  First, just to make sure it wasn't the ESXi software, I actually removed my ESXi flash drive, put a new SSD in and loaded Windows 10.  In Win10, I enabled the Hyper-V role and then converted all my VM's.  I added some new drives in (was planning anyway) so each VM has its own SSD.  At first it was fine, but then after a week, I started having trouble.  I added another SSD and moved a VM to it.  Loading that VM caused near instant lock up.  I swapped it back to another drive and the problem was less so. 

So, I actually had a backup pair of the CPU and Motherboard that I'm using in a desktop.  I swapped that and rehooked everything up.  Everything seems to be running fine now.  So I'm not sure if its the CPU or Motherboard, but it definitely was one of the two.

Reply
0 Kudos