Hey Guys,
Got a really odd one here, for the last week I noticed all VM backups (Veeam 9.5) blew out by 35-45mins each. The problem is with 70+ VMs this pushed us outside the backup Window. In terms of troubleshooting so far we haven't got far, I've reviewed the vCenter logs and rebooted EVERYTHING. What we found from this is that when we reboot vCenter we have no issues for about 15 snapshots. The second thing we tried was snapshots from vCenter (rather than Veeam) but we have the same issue and it's only in creation, not removal. No matter how large or small the VM they all take the same time (even a 5GB CentOS Server) roughly 35-45mins. All tha past tasks looking at the vCeneter logs were between 10-130seconds for the
From what i can see in the vCenter logs there is a timeout peroid that is hit causing the task to be canceled/stopped but this is not causing a failure or any issues, everything still works fine. Its just takes 40mins longer per VM for any snapshots.
Here are the the vCenter logs:
2017-04-14T23:07:39.287+10:00 error vpxd[05084] [Originator@6876 sub=MoEvent opID=4569d102] [EventManagerMoImpl::EventBatchAppender::EventAppenderJob::PostJob] VirtualCenter did not respond to the request to append events after 2400 seconds
2017-04-14T23:07:39.291+10:00 error vpxd[05084] [Originator@6876 sub=Vmomi opID=4569d102] [VpxVmomi] IO exception while sending fault
2017-04-14T23:07:39.291+10:00 info vpxd[05084] [Originator@6876 sub=vpxLro opID=4569d102] [VpxLRO] -- FINISH task-internal-27952
2017-04-14T23:07:39.291+10:00 info vpxd[05084] [Originator@6876 sub=Default opID=4569d102] [VpxLRO] -- ERROR task-internal-27952 -- EventManager -- vim.event.EventManager.logUserEvent: vmodl.fault.UnexpectedFault:
--> Result:
--> (vmodl.fault.UnexpectedFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultName = "vim.fault.Timedout",
--> fault = (vmodl.MethodFault) null,
--> msg = ""
--> }
--> Args:
-->
--> Arg entity:
--> 'vim.VirtualMachine:vm-2741'
--> Arg msg:
--> "Source: Veeam Backup
--> Action: Job "Backup"
--> Operation: Stopped
--> Status: Success
--> ID: 9444fd99-6cb7-45c0-90b7-cead7533a244
--> Description: "
2017-04-14T23:07:39.297+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] VpxdInvtVm::ProcessVmInfo [ProcessVmInfoTime] took 1477703 ms
2017-04-14T23:07:39.297+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] VpxdInvtVm::ProcessVmChanges [ProcessVmChangesTime] took 1477703 ms
2017-04-14T23:07:39.300+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] [VpxdHostSync] ProcessChanges host:syd-sy1-ex02.cloud.local [ProcessChanges] took 1477706 ms
2017-04-14T23:07:39.301+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] [VpxdHostSync] DoHostSync:0000000004A81420 [DoHostSyncTime] took 1478249 ms
2017-04-14T23:07:39.301+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] InvtHostSyncLRO::StartWork [HostSyncTime] took 1478249 ms
2017-04-14T23:07:39.301+10:00 warning vpxd[05316] [Originator@6876 sub=VpxProfiler opID=HB-host-19@28497-1d592c26] VpxLro::LroMain [TotalTime] took 1478249 ms
Going bonkers here guys, not idea whats going on. Thanks in advance for any help
I've created a project on Freelancer also:vmware vcenter | VMware
Cheer.
UPDATE: Fixed but not solved.
Hi Everyone,
So I ended up taking a bit of a sledgehammer approach to the issue as time to troubleshoot was running thin.
Actions:
- Re-indexed vPostgreSQL
- Backup VCDB in PostgreSQL with python (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=20919... )
- Restored vCenter from two years ago (before any trace of DB issues)
- Disabled vCenter service
- Reset VM password (https://msdn.microsoft.com/en-us/powershell/reference/5.1/microsoft.powershell.management/reset-comp... )
- Patched and updated
- Restored VCDB (as per link above)
- Started vCenter
It's been a few days and seems stable so I'm putting in the done pile but will update if it pops back up.
delay may be due to issue with your storage. Check your storage controller and logs for any hardware issue on the storage.
No, the issue is apparent across multiple storage arrays and DAS. Also it's temporary resolved following the vCenter reboot and has only been apparent in the last few weeks after 24 months without issue. If it was a storage issue i'd expect to see more symptoms like other VMs on that array with poor performance stats, pauses or failures.
There is no adverse effect aside from the elongated snapshot task aside from pushing out the backup windows. Seeing as snapshot creation is fairly light on IO, or any resources it seems like something in vCenter is hanging pr missing the bus.
Is there any changes done on the environment?
version of Vsphere and ESXI host Hardware model?
No changes.
ESXi 6.0.0, 3825889
vCenter 6.0 Update 2
Flex x240 M5
EMC XtreamIO
Dell Compellent SC9000
All firmware is current.
UPDATE: The issue is temporarily resolved 12-24Hrs by restarting vCenter (whole Windows VM) but then the problem returns.
I'm suffering through the exact same problem that has been plaguing me since rolling out Veeam 9.5 Update 1. We're also on vCenter 6 Update 2, and this was also happening while running update 1. I'm seeing very similar error messages as you.
2017-04-17T08:15:41.545-04:00 info vpxd[36060] [Originator@6876 sub=vpxLro opID=668fcc94] [VpxLRO] -- FINISH task-internal-1128758
2017-04-17T08:15:41.545-04:00 info vpxd[36060] [Originator@6876 sub=Default opID=668fcc94] [VpxLRO] -- ERROR task-internal-1128758 -- EventManager -- vim.event.EventManager.logUserEvent: vmodl.fault.UnexpectedFault:
--> Result:
--> (vmodl.fault.UnexpectedFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultName = "vim.fault.Timedout",
--> fault = (vmodl.MethodFault) null,
--> msg = ""
--> }
--> Args:
-->
--> Arg entity:
--> 'vim.VirtualMachine:vm-2015'
--> Arg msg:
--> "Source: Veeam Backup
--> Action: Job "REDACTED"
--> Operation: Stopped
--> Status: Failed
--> ID: 5f4d48d2-937a-4f0c-b020-d00977c8212f
--> Description: "
2017-04-17T08:15:41.694-04:00 info vpxd[30920] [Originator@6876 sub=vpxLro opID=opId-2def0694-2ff3-4489-a88c-ef3dc2d2f599-62-a4] [VpxLRO] -- BEGIN task-internal-1141421 -- ServiceInstance -- vim.ServiceInstance.GetServerClock -- 520eebdd-3dbe-31ea-dbb9-99f45108c628(528fa8c7-e19b-34ce-9f7f-5add55cf5e2f)
2017-04-17T08:15:41.696-04:00 info vpxd[30920] [Originator@6876 sub=vpxLro opID=opId-2def0694-2ff3-4489-a88c-ef3dc2d2f599-62-a4] [VpxLRO] -- FINISH task-internal-1141421
2017-04-17T08:15:41.702-04:00 info vpxd[40616] [Originator@6876 sub=vpxLro opID=opId-0b1361a7-b76c-4aaa-9c21-30b4058aec6b-57-16] [VpxLRO] -- BEGIN task-internal-1141422 -- ServiceInstance -- vim.ServiceInstance.GetServerClock -- 52a39e5c-8380-7745-cab2-0ec8bfdd8ccf(52fb4d07-2a3d-94b9-84f2-aab1aba4c5f0)
2017-04-17T08:15:41.704-04:00 info vpxd[40616] [Originator@6876 sub=vpxLro opID=opId-0b1361a7-b76c-4aaa-9c21-30b4058aec6b-57-16] [VpxLRO] -- FINISH task-internal-1141422
2017-04-17T08:15:41.722-04:00 info vpxd[41648] [Originator@6876 sub=vpxLro opID=opId-82e7e45d-a6c4-49df-90bd-fd98a430f8b1-68-79] [VpxLRO] -- BEGIN task-internal-1141423 -- ServiceInstance -- vim.ServiceInstance.GetServerClock -- 527f1498-f872-9c49-6ff0-34f156d67cc9(52eb3d4b-f9dc-5810-67f8-10551a25df17)
2017-04-17T08:15:41.722-04:00 info vpxd[41648] [Originator@6876 sub=vpxLro opID=opId-82e7e45d-a6c4-49df-90bd-fd98a430f8b1-68-79] [VpxLRO] -- FINISH task-internal-1141423
2017-04-17T08:15:52.176-04:00 error vpxd[40044] [Originator@6876 sub=MoEvent opID=4793cbf4] [EventManagerMoImpl::EventBatchAppender::EventAppenderJob::PostJob] VirtualCenter did not respond to the request to append events after 2400 seconds
2017-04-17T08:15:52.193-04:00 error vpxd[40044] [Originator@6876 sub=Vmomi opID=4793cbf4] [VpxVmomi] IO exception while sending fault
2017-04-17T08:15:52.193-04:00 info vpxd[40044] [Originator@6876 sub=vpxLro opID=4793cbf4] [VpxLRO] -- FINISH task-internal-1128855
2017-04-17T08:15:52.193-04:00 info vpxd[40044] [Originator@6876 sub=Default opID=4793cbf4] [VpxLRO] -- ERROR task-internal-1128855 -- EventManager -- vim.event.EventManager.logUserEvent: vmodl.fault.UnexpectedFault:
--> Result:
--> (vmodl.fault.UnexpectedFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultName = "vim.fault.Timedout",
--> fault = (vmodl.MethodFault) null,
--> msg = ""
--> }
--> Args:
-->
--> Arg entity:
--> 'vim.VirtualMachine:vm-1620'
--> Arg msg:
--> "Source: Veeam Backup
--> Action: Job "REDACTED"
--> Operation: Stopped
--> Status: Success
--> ID: f03e5766-4385-4c11-92f3-a1f766849829
--> Description: "
The only errors I see before this are regarding the Web server.
2017-04-17T07:51:25.553-04:00 error vpxd[37000] [Originator@6876 sub=HttpSvc.HTTPService] Failed to read request; stream: <io_obj p:0x000000002e11caa8, h:-1, <pipe '\\.\pipe\vmware-vpxd-webserver-pipe'>, <pipe '\\.\pipe\vmware-vpxd-webserver-pipe'>>, error: class Vmacore::TimeoutException(Operation timed out)
2017-04-17T07:51:25.554-04:00 error vpxd[38936] [Originator@6876 sub=HttpSvc.HTTPService] Failed to read request; stream: <io_obj p:0x00000000176be8f8, h:-1, <pipe '\\.\pipe\vmware-vpxd-webserver-pipe'>, <pipe '\\.\pipe\vmware-vpxd-webserver-pipe'>>, error: class Vmacore::TimeoutException(Operation timed out)
The next errors after these messages involve failures with Veeam job information that match your output. Do you happen to also see these HTTP pipe failures in your logs?
HTTP pipe errors in Veeam or vCenter logs?have the same problem with manual snapshots through vCenter
I don't think it's a Veeam issue as I have the same problem when creating snapshots from vCenter directly.
Those pipe errors are from the VPXD logs. Veeam is showing errors in it's log saying that it's failing to disconnect it's session due to a forbidden message from vCenter. I believe this is causing stale sessions to build which may be causing the vcenterServer service to crash. I'll post more here as we investigate.
Hi,
I am facing the same issue with snapshot delays and after VC reboot it got fixed for a while but popped up again. If you found any update , please let me know.
Thanks in advance.
did you check vmware.log file ?
Yes, i checked and found same type of logs mentioned in above.
okie,
Did you try to take snapshot individually by login to standalone ESXI? How did it go?
could you also check the resource utilization of vCenter?
Manual snapshot is working fine. But in VPXD.log seeing below timeout errors a lot
2017-06-14T14:45:00.605-05:00 error vpxd[06776] [Originator@6876 sub=SSL SoapAdapter.HTTPService] Failed to read request; stream: <SSL(<io_obj p:0x0000000043a45b68, h:-1, <TCP '0.0.0.0:0'>, <TCP '0.0.0.0:0'>>)>, error: class Vmacore::TimeoutException(Operation timed out)
Failed to accept connection; <acceptor p:0x0000000011ab65e0, h:5380, <TCP '127.0.0.1:8085'>>, e: system:995(The I/O operation has been aborted because of either a thread exit or an application request)
Failed to connect socket; <io_obj p:0x0000000104eb1e38, h:1056, <TCP '0.0.0.0:0'>, <TCP '127.0.0.1:8088'>>, e: system:10061(No connection could be made because the target machine actively refused it)
Snapshot from the ESXi host directly takes about 43ms
Utilisation on vCenter
- CPU 5% (6 Cores)
- RAM 60% (12GB)
- Disk <10ms latency
I found these errors in vPostgres but not sure if related.
2017-06-19 00:07:01.792 AEST 594688f4.59cc 0 ERROR: left link changed unexpectedly in block 75 of index "vpxi_task_complete_time"
2017-06-19 00:07:01.792 AEST 594688f4.59cc 0 CONTEXT: automatic vacuum of table "VCDB.vc.vpx_task"
2017-06-20T08:57:38.094+10:00 info vpxd[05472] [Originator@6876 sub=vpxLro opID=6e73ffa9] [VpxLRO] -- FINISH task-internal-1952981
2017-06-20T08:57:38.097+10:00 info vpxd[04964] [Originator@6876 sub=vpxLro opID=18ed54da] [VpxLRO] -- BEGIN task-internal-1952982 -- host-19 -- vim.HostSystem.GetConfig -- 52881301-3cac-4455-93c3-7dbde1b1914a(52b2533d-00e0-ef7c-3cca-2808258c3075)
2017-06-20T08:57:38.181+10:00 info vpxd[04964] [Originator@6876 sub=vpxLro opID=18ed54da] [VpxLRO] -- FINISH task-internal-1952982
2017-06-20T08:57:38.266+10:00 error vpxd[17356] [Originator@6876 sub=SoapAdapter.HTTPService] Failed to read request; stream: <io_obj p:0x000000000b06fa58, h:4936, <TCP '127.0.0.1:8085'>, <TCP '127.0.0.1:59501'>>, error: class Vmacore::SystemException(An existing connection was forcibly closed by the remote host)
2017-06-20T08:57:38.850+10:00 info vpxd[00960] [Originator@6876 sub=vpxLro opID=4597126d] [VpxLRO] -- BEGIN task-internal-1952984 -- session[527c9b0a-5c45-a878-3047-2391bdfb72c5]52e2df17-65d7-b429-ba54-6cb3c9090e68 -- vmodl.query.PropertyCollector.Filter.destroy -- 527c9b0a-5c45-a878-3047-2391bdfb72c5(520ab8a4-fffc-1087-c2e2-c8528b600451)
2017-06-20T08:57:38.850+10:00 info vpxd[00960] [Originator@6876 sub=vpxLro opID=4597126d] [VpxLRO] -- FINISH task-internal-1952984
2017-06-20T08:57:38.911+10:00 info vpxd[07412] [Originator@6876 sub=vpxLro opID=6cb36ffe] [VpxLRO] -- BEGIN task-internal-1952986 -- SearchIndex -- vim.SearchIndex.findByDatastorePath -- 527c9b0a-5c45-a878-3047-2391bdfb72c5(520ab8a4-fffc-1087-c2e2-c8528b600451)
2017-06-20T08:57:38.913+10:00 info vpxd[07412] [Originator@6876 sub=vpxLro opID=6cb36ffe] [VpxLRO] -- FINISH task-internal-1952986
2017-06-20T08:57:40.035+10:00 info vpxd[06332] [Originator@6876 sub=vpxLro opID=394edd24] [VpxLRO] -- BEGIN task-internal-1952988 -- session[527c9b0a-5c45-a878-3047-2391bdfb72c5]523ea4a8-848b-47ed-ccc0-6cdab202b110 -- vmodl.query.PropertyCollector.Filter.destroy -- 527c9b0a-5c45-a878-3047-2391bdfb72c5(520ab8a4-fffc-1087-c2e2-c8528b600451)
2017-06-20T08:57:40.035+10:00 info vpxd[06332] [Originator@6876 sub=vpxLro opID=394edd24] [VpxLRO] -- FINISH task-internal-1952988
2017-06-20T08:57:40.085+10:00 info vpxd[09148] [Originator@6876 sub=vpxLro opID=360ac313] [VpxLRO] -- BEGIN task-internal-1952990 -- SearchIndex -- vim.SearchIndex.findByDatastorePath -- 527c9b0a-5c45-a878-3047-2391bdfb72c5(520ab8a4-fffc-1087-c2e2-c8528b600451)
2017-06-20T08:57:40.087+10:00 info vpxd[09148] [Originator@6876 sub=vpxLro opID=360ac313] [VpxLRO] -- FINISH task-internal-1952990
2017-06-20T08:57:41.334+10:00 error vpxd[05364] [Originator@6876 sub=MoEvent opID=b769632] [EventManagerMoImpl::EventBatchAppender::EventAppenderJob::PostJob] VirtualCenter did not respond to the request to append events after 2400 seconds (AKA 40mins)
2017-06-20T08:57:41.338+10:00 error vpxd[05364] [Originator@6876 sub=Vmomi opID=b769632] [VpxVmomi] IO exception while sending fault
2017-06-20T08:57:41.338+10:00 info vpxd[05364] [Originator@6876 sub=vpxLro opID=b769632] [VpxLRO] -- FINISH task-internal-1947434
2017-06-20T08:57:41.338+10:00 info vpxd[05364] [Originator@6876 sub=Default opID=b769632] [VpxLRO] -- ERROR task-internal-1947434 -- EventManager -- vim.event.EventManager.postEvent: vmodl.fault.UnexpectedFault:
--> Result:
--> (vmodl.fault.UnexpectedFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultName = "vim.fault.Timedout",
--> fault = (vmodl.MethodFault) null,
--> msg = ""
--> }
--> Args:
-->
--> Arg eventToPost:
--> (vim.event.EventEx) {
--> key = 6201771,
--> chainId = 6201771,
--> createdTime = "2017-06-19T22:17:41.334532Z",
--> userName = "username",
--> datacenter = (vim.event.DatacenterEventArgument) {
--> name = "datacenter",
--> datacenter = 'vim.Datacenter:f52e-512d-4251-a59c-6538f60cd274:datacente'
--> },
--> computeResource = (vim.event.ComputeResourceEventArgument) null,
--> host = (vim.event.HostEventArgument) null,
--> vm = (vim.event.VmEventArgument) null,
--> ds = (vim.event.DatastoreEventArgument) null,
--> net = (vim.event.NetworkEventArgument) null,
--> dvs = (vim.event.DvsEventArgument) null,
--> fullFormattedMessage = <unset>,
--> changeTag = <unset>,
--> eventTypeId = "com.vmware.vshield.edge.gateway.HealthcheckNoPulse",
--> severity = "info",
--> message = <unset>,
--> objectId = "datacenter-2",
--> objectType = "vim.Datacenter",
--> objectName = "datacenter",
--> fault = (vmodl.MethodFault) null,
--> }
--> Arg taskInfo:
-->
2017-06-20T08:57:42.777+10:00 info vpxd[12248] [Originator@6876 sub=vpxLro opID=PHS-4bbf1bbc-ec73-4420-8473-179617980566-1a-de-e7] [VpxLRO] -- BEGIN task-internal-1952993 -- domain-c25 -- vim.ComputeResource.GetHost -- 521c16b8-9819-78c4-2265-88bbd0a043c0(524a732b-ac43-3d0e-1477-2bdfdc7699b4)
2017-06-20T08:57:42.778+10:00 info vpxd[12248] [Originator@6876 sub=vpxLro opID=PHS-4bbf1bbc-ec73-4420-8473-179617980566-1a-de-e7] [VpxLRO] -- FINISH task-internal-1952993
vPostgress can't be the issue here as u mentioned that issue is the same while directly login to ESXI host and taking snapshot took same time
so the issue can be either the patch level of ESXi, Storage or if the VM's are real too busy to allow snapshot
do you have any local disk on ESXI to test the snapshot location change ?
Snapshot:
vCenter = 40mins
ESXi = <40ms
So on ESXi directly it's almost instant.
ohkie i see now . could you create a dummy VM without configuring in veeam and take snapshot to check how it goes?
Just created a new blank VM with a 1GB Thin disk and its the same issue