Solved: Hosts Disconnect During VMotion

stfconsulting · ‎01-31-2015

Hi Guys, battling a tough problem with a new cluster and could use some feedback: We have been having issues where one of the hosts (random) goes into a disconnected state. If you let it sit long enough it will come back by itself. Restarting the vcenter services seems to speed up the process also. Last night I went to put on of the hosts into maintenance mode (to apply the latest patches) and all sorts of bad things happened. Got to the point where 2 hosts became disconnected. Situation got so bad that one of the hosts could not get back into the cluster and we had to shut machines down and re-register them on the cluster hosts. (Fun at 3:00AM) . VMware was a little stumped last night with what was happening so I have to re-engage them next week. Any help / ideas would be much appreciated.

Here is the hardware / details.

-Running 5.5 Build 2302651 (Dell Specific ISO)

-3 x Dell PowerEdge R730 [Boot from Flash] (firmware completely up to date)

-1 x Dell Powervault MD3420 12gb SAS connectivity (dual controller) (firmware current)

-There is a Dell PowerEdge R730XD direct connected to the MD3420 running Windows 2012R2 / Veeam 8.0 Update 1 for backups

-We are not sure if Veeam could be causing this to happen. Trying to get them involved. For now we have Veeam completely disabled.

-A putty session to one of the disconnected hosts that had locked up during a management agent restart came back magically when a veeam replication job was cancelled (was replicating a machine out of the backup repository so I have no idea why that would matter)

-Currently have DRS automation and Application monitoring disabled to mitigate risk

-Starting to move workloads to another cluster to reduce risk

stfconsulting · ‎02-06-2015

Well after a lot of torture here is the solution:

-Drivers that Dell bundles with latest 5.5U2 ESX media are no good. You have to use the latest VMWARE drivers for the SAS3 12gb HBAs.

-After swapping out the driver on 2 hosts I tested vmotion and could not reproduce the problem.

-For good measure we updated the firmware on the MD3420 to 8.20 which was mainly new features

-Updated VMWARE to latest build

After this experience we are going to make a couple of changes to our deployment strategy to hopefully flush out a problem like this prior to the cluster going into production. We did many tests of vmotion and storage vmotion before going to prime time however we did not push it hard enough. Once automated DRS kicked in and tried to move many things at once that is when we started to see these issues. If we had put a lot of pressure on the cluster and observed the vmkernel logs we could have observed the error messages.

Couple all of this with trying to migrate to Veeam it made it difficult to isolate. Once Veeam was shutdown and the problems still continued it made it much easier to focus our efforts.

Hopefully someone else that encounters this with the LSI solution (HP, Dell, IBM) will benefit from this thread.

View solution in original post

stfconsulting · ‎02-01-2015

Some more info:

After looking at the logs on one of the hosts that disconnected during the vmotion process I found this over and over again:

6152: Rank violation threshold reached

Going to contact Dell today to see if this is a known issue with the MD3420 / R730 combo.

I completely shutdown the server that is connected to the MD3420 (Veeam) for troubleshooting and I can still reproduce this.

Weird thing is that the machines that got stuck in the vmotion process finally completed about 1 hr later.

vThinkBeyondVM · ‎02-01-2015

I would suggest you to log support request with VMware as well.

----------------------------------------------------------------
Thanks & Regards
Vikas, VCP70, MCTS on AD, SCJP6.0, VCF, vSphere with Tanzu specialist.
https://vThinkBeyondVM.com/about
-----------------------------------------------------------------
Disclaimer: Any views or opinions expressed here are strictly my own. I am solely responsible for all content published here. Content published here is not read, reviewed or approved in advance by VMware and does not necessarily represent or reflect the views or opinions of VMware.

stfconsulting · ‎02-01-2015

I have a case open right now. Hopefully tomorrow they will get back to me on the findings from the logs. Going to open a case with Dell today to see if this is something on their end. Thanks!

stfconsulting · ‎02-01-2015

I was able to get into contact with VMware tonight / this morning and spent a few hours working with support. I am able to reproduce the vmotion issue over and over again. Problem has nothing to do with Veeam because the backup server is completely powered down. If you try to vmotion more than a couple of workloads at once you the host goes into a non responsive state and the rank violations start spewing out in the logs. VMWare wanted all three hosts restarted for troubleshooting purposes. It was very difficult to move stuff around with all the issues with vmotion however I was able to get all three hosts restarted. To simplify matters I disabled HA and DRS on the cluster completely. None of my troubleshooting steps fixed the issue tonight. I continue to move machines off the bad cluster to other clusters to reduce the risk. Hopefully this can get escalated tomorrow and we can work towards getting this resolved.

stfconsulting · ‎02-06-2015

Well after a lot of torture here is the solution:

-Drivers that Dell bundles with latest 5.5U2 ESX media are no good. You have to use the latest VMWARE drivers for the SAS3 12gb HBAs.

-After swapping out the driver on 2 hosts I tested vmotion and could not reproduce the problem.

-For good measure we updated the firmware on the MD3420 to 8.20 which was mainly new features

-Updated VMWARE to latest build

After this experience we are going to make a couple of changes to our deployment strategy to hopefully flush out a problem like this prior to the cluster going into production. We did many tests of vmotion and storage vmotion before going to prime time however we did not push it hard enough. Once automated DRS kicked in and tried to move many things at once that is when we started to see these issues. If we had put a lot of pressure on the cluster and observed the vmkernel logs we could have observed the error messages.

Couple all of this with trying to migrate to Veeam it made it difficult to isolate. Once Veeam was shutdown and the problems still continued it made it much easier to focus our efforts.

Hopefully someone else that encounters this with the LSI solution (HP, Dell, IBM) will benefit from this thread.

All

Hosts Disconnect During VMotion