I have been experiencing random VM's hanging during the "Removing Snapshot" phase of VSS Quiesced vSphere Replication.
Note: I have an open case with VMware support (Since August 14th 2014) and Microsoft and am actively working on it with them, however I want to share my current predicament and see if others are having similar experiences that can help lead to a resolution.
Brand new environment with vSphere Enterprise 5.5u2, SRM and vSphere Replication
New Dell R720 servers iSCSI SAN attached to Dell Equallogic PS6210 SSD and NLSAS SAN's (using Equallogic MEM plugin).
10GbE Cisco Nexus 5548 Networking (inculduing 10GbE Intersite links).
~18 mostly W2K8 R2 VM's are replication with vSphere Replication and Volume Shadow Copy Services VSS Quiescing.
Random W2K8 R2 VM's running on different ESXi servers have been occasionally hanging during snapshot consolidation (vSphere Client shows "Removing Snapshot XX%" but it does not proceed).
The vSphere replication process triggers a Quiesced Snapshot, then when the snapshot is being removed, or attempting to be removed, the VM hangs is a very bad way.
The VM is no longer pingable, and cannot be rebooted (basically you cannot do anything with it because it is stuck in the removing snapshot state).
With VMware support, we attempted to hard kill the VM process from the command line, but no luck.
We tried to reboot the ESXi server (after moving all other VM's off), but the ESXi server itself will not even reboot. This hang is so bad that the ESXi server cannot reboot cleanly.
Eventually, I have to reset the ESXi server, then once the system comes back up, I Consolidate the Snapshot on the problematic VM, then power that VM back on.
VMware have looked over my system, and verified that everything is functioning properly with the infrastructure. SAN connectivity etc. is all good. All software versions are up to date etc.
The logs show very little related to the hung VM. We see in the logs that the snapshot consolidation was started, then the next thing you see in the logs is that the VMware Tools heartbeat messages are not being received (ie. VM hung).
We usually detect this issue in two ways.
-vSphere Replication RPO violations
-detect that the problematic VM is no longer reachable
To verify the problem, I login with the vSphere Client directly to the ESXi server with the problematic VM. In the Recent Tasks pane, there will be an in-progress "Removing Snapshot XX%" task that never completes. Additionally, the vmware.log in the VM's directory will have its last log message at the same time as the Removing Snapshot task started.
Is anyone else experiencing this issue?
If so, what is your setup (ie. servers, SANs, vSphere version, networking, etc.)?
We are experiencing the exact same behavior with the exact same version of VMware vSphere and SRM... We have found that the only way to prevent the hanging is to disable quiescing for replication. We have opened multiple tickets now with VMware only to have them not be able to solve it and in fact point to hardware as a source of the problem - we are running fast servers and a new 3Par, and the guest in question at the moment is a very low IOPS file server.
So, it is good to know we are not the only ones. What is not good is that we need to use quiescing and it is a fully supported function of the application in this configuration (hence having a plainly visible checkbox), and it is intermittently causing complete downtime for certain mission critical services.
I am still actively working on this issue, and my current escalation engineer seems to be making some progress.
I have given him logs of the issue from several occasions on different VM's. He has found a pattern, and an incorrect VM state that he is now investigating.
Here is a quote from the support case:
"Here is an example of non-graceful cycle where the VR snapshot is created. But its removal never gets completed. The state of the VM goes from:
"VM_STATE_REMOVE_SNAPSHO" to "VM_STATE_CREATE_SCREENSHOT"
which is not the expected transition. In other words the final successful transition should be
"VM_STATE_REMOVE_SNAPSHO" to "VM_STATE_ON" instead."
So, he is investigating the non-valid state: "VM_STATE_CREATE_SCREENSHOT"
If you want to give me your case number, I can pass it on to my engineer, and see if he can include it in his investigation. Could help both of us.
I'm working with Wizardberry and here is our case number for the issue: 4739592734
Hope it'll help to find a solution. We were having this issue prior to 5.5U2. Thought SRM 5.8 would have ironed some, however it wasn't the case.
I am from VMware Global Support services and I am working with Juice14 on this issue. You have mentioned ticket number "4739592734". Is that a VMware support service request number? If so I have searched for it and it appears that it is not a a valid VMware support request number. Please could you verify that number and provide us with it again? I would like to take a closer look at the ESX logs from your environment to see if we have the same failure patterns.
Do you have a VMware Support Request number in which you had reported this issue? If so, please share it with us here so I can look at the logs from your environment as well.
I looked at your ticket as well the ESXi logs from your server and I see some difference between your case and the one of Juice14.
1. Snapshot creation for one VM failed because it had 255 snapshots already and that is the upper limit.
2. Most other snapshot creation tasks that fail are failing with "Failed to quiesce the virtual machine".
3. You are using third party backup tools top backup your VMs that are also generating snapshots. There are know compatibility issues between VR and third party backup tools when both when quiescing is enabled. Only one should be enabled for a quiesce a VM during snapshot creation and consolidation. See KB# 2040754 (http://kb.vmware.com/kb/2040754) for details on this.
4. Also keep in mind that the dynamic disks within the guest OS are not supported with application quiescing and VSS.
5. One other thing is that I am not seeing unusual state transitions in your logs they way they show up in Juice14' case.
So in your case, you have to keep VR (VSS) quiescing disabled to avoid problem with snapshot consolidation and VM responsiveness during these operations.
I'm not sure if the references you are mentioning are for our environment or the other one. Either way:
1- According to the article you are referencing, the maximum number of snapshots is 32 not 255
2- Right, and that would explain it if the upper limit is reached, right?
3- There is no other option but third party tools for backup unless VMware is doing backup tools now. Nonetheless, backups occur at night and we are experiencing issues during the day. Once again, we didn't miss a backup on these particular VMs.
4- We are not using dynamic disks in our case.
5- Maybe... still, similar issues. We need to find Waldo.
I'm not sure we have snapshot consolidation issues as usually when it is the case, we get a yellow warning stating that we need to consolidate the disk and I don't recall seeing it when the froze occur.
I've had a similar issue. My case number is 14519767208. We were at vCenter 5.1.x and VR 5.1.x.x. The issue happened equally on hosts that are on Cisco UCS and IBM BladeCenter, using EMC VMAX SAN. Disabling quiescing lessens the impact, but the replicated VMs are still affected at least slightly every time there is a sync happening. To test this, I start a consistent ping to a non-replicated VM on HostA in VLAN1 and another to a replicated VM on HostA in VLAN1, where both VMs have the same OS and VMXNet3 NICs. I'll let it run for a while and every time I try this the replicated VM will have 1% - 2% dropped pings, where the non-replicated one never has any. The ping drops always coincide with the freezes within the VM's OS, which coincide with the synchronizations.
I had two production VMs running SQL which were being replicated, but I had to stop using VR a while back because the issue was severely impacting users. I'm afraid to try this again on any production servers due to the anger it caused. Ironically, one of those two VMs is named "Waldo"!
We just upgraded to vCenter 5.5.0 2175560 and VR 188.8.131.52. I haven't had a chance to test using this version yet because of a different issue, which I came here to post.
I think there is a pattern here.
I am actively working with DELL and VMWARE on a very similar issue.
Our main file server (shares etc.) has been replicating to our DR site without issues, or so I thought. Last Monday the machine crashed around 4 times under moderate load. Initially, it was thought there was a storage issue with the DELL EqualLogic but the other VM's seemed fine - even the ones replicating to the DR site. On closer inspection, we have been running on Snapshots (55 of them) since the replication went wrong (my presumption).
This weekend has been a nightmare. We have tried everything to consolidate the disks but I am now worse off than before VMWARE and DELL got involved because the VM only works for about 20 minutes and then needs to be powered off and on.
I am currently trying to safe guard data and get everything off the VM, as you would expect. I can't believe I'm in this position actually, but it does look like there are some issues with the replication in 5.5 that needs some serious looking at.
I will post back here once our infrastructure is safe.....
Similar issue Here. After upgrading from 5.1 to 5.5u1, quiesced snapshot for backup via ibm tsm for ve started to hang randomly some w2k8r2, on take or remove snapshot (task stuck forever) ; each time only esxi vm process kill have restored the vm. Disable vss fix the problem but could not be the solution.
Tried upgrade of both vcenter and esxi to 5.5u2 but issue is still present.
My SR to vmware was unuseful.
My finding is that it's not as bad with quiescing off, but it still happens. For us, when the VM hangs, it does so to the point where it drops off the network. I would say to anyone who is seeing the issue, do a continuous ping of the a replicated VM, and at the same time ping a non-replicated VM on the same host and in the same subnet. Let them run for maybe an hour and compare the number of dropped pings after you stop each one. In our case, the replicated machine always has at least a few drops, where the machine that isn't replicated has none.