VMware Cloud Community
harderock
Contributor
Contributor

VM migration fails on 6.5U2

Hello,

We encountered a vmx crash problem when migrating a VM from one host to another. Both ESXi version of src and dst host are 6.5U2. The backend storage of the VM is an NFS datastore, working with the certified NAS-VAAI plugin: VMware Compatibility Guide - Storage/SAN Search

The vmware.log shows that an invalid address is accessed when opening a database connection (the VM has a .db file) during vMotion:

2020-04-18T07:02:01.303Z| vmx| I125: NamespaceMgrCheckpoint: opening database connection (Huayun_OU01.db).
2020-04-18T07:02:01Z[+0.052]| vmx| W115: Caught signal 11 -- tid 124426 (addr 28)
2020-04-18T07:02:01Z[+0.052]| vmx| I125: SIGNAL: rip 0xb47d8d7499 rsp 0x35978196518 rbp 0x359781965f0
2020-04-18T07:02:01Z[+0.052]| vmx| I125: SIGNAL: rax 0x28 rbx 0xb43c001900 rcx 0x0 rdx 0x8 rsi 0x3597819657c rdi 0x28
2020-04-18T07:02:01Z[+0.052]| vmx| I125:         r8 0x0 r9 0x7 r10 0xfffffffffffff547 r11 0xb47d8d60d7 r12 0x35978196530 r13 0xb43b39bb7b r14 0x35978196532 r15 0x0

Some more tests show that:

1. Only a few VMs have this problem. For these VMs, storage vMotion from NFS datastore to another datastore is also failed, and has the same error log (accessing invalid addr 28).

2. If the VM is rebooted before migrating, migration could be successful.

Attachment is the vmware.log of the failed VM, and the zdump generated when vmx crashed.

Any comments are welcome.

Reply
0 Kudos
11 Replies
harderock
Contributor
Contributor

Could someone help to take a look at this issue?

Reply
0 Kudos
nachogonzalez
Commander
Commander

Hi, some questions:

- Is the NFS share visible to all ESXi hosts?
- Is this a single DB instance or is this a clustered VM?
- Does it have RDM's?
- If you try to do a host vMotion, does it complete succesfully?

Warm regards

Reply
0 Kudos
harderock
Contributor
Contributor

Thanks for following up.

> Is the NFS share visible to all ESXi hosts?

Yes

> Is this a single DB instance or is this a clustered VM?

This is not a clustered VM, it's a regular VM which doesn't even run DB application, but there is a .db file in the VM default directory on the NFS datastore.

> Does it have RDM's?

No

> If you try to do a host vMotion, does it complete succesfully?

No, the vmware.log attached shows that host vMotion failed.

Let me know if you need any more info.

Regards

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

You're encountering the problem described by this KB article: VMware Knowledge Base: "There is no VM_NAME process running for config file" error when deploying a ...​.

I believe a fix for this issue was delivered in 6.5 U3.  There is ongoing discussion as to whether or not the fix will address the issue under all conditions, but please do consider upgrading if you have the opportunity.

Thanks,

--

Darius

Reply
0 Kudos
harderock
Contributor
Contributor

Thanks for the reply, Darius.

The symptoms described in the KB looks different with our symptoms. But the root cause seems related to our case.

This issue occur due to a race condition between Namespace DB creation and migration activity.

> I believe a fix for this issue was delivered in 6.5 U3.

We did one more test that migrating the VM from 6.5U2 to 6.5U3 still failed. Is this failure expected due to this known issue?

In order to migrate successfully, do we need to guarantee that both src and dst host are 6.5U3? I have this question because my understanding is, the race condition only occurs in the dst host, as long as the dst host is 6.5U3, the migration should be OK.

Thanks,

Ping

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

Did the destination VM on 6.5 U3 crash in the same way, or was the error message different?  If different, can you attach a vmware.log?

It looks like the change is in the dest side, so 6.5 U2 to U3 should have used the improved code, but it looks like there will be some cases where the change will still not prevent a migration failure but would hopefully improve the error reporting.

Thanks,

--

Darius

Reply
0 Kudos
harderock
Contributor
Contributor

Sorry for the delayed reply, Darius.

> Did the destination VM on 6.5 U3 crash in the same way, or was the error message different?

> If different, can you attach a vmware.log?

We did one more test today, it looks both the error message and backtrace are same.

2020-05-07T15:23:09.095+08:00| vmx| I125: NamespaceMgrCheckpoint: opening database connection (Huayun_OU01.db).

2020-05-07T15:23:09+08:00[+0.027]| vmx| W115: Caught signal 11 -- tid 368406938 (addr 28)

2020-05-07T15:23:09+08:00[+0.027]| vmx| I125: SIGNAL: rip 0x9547e93499 rsp 0x35e78b3f518 rbp 0x35e78b3f5f0

2020-05-07T15:23:09+08:00[+0.027]| vmx| I125: SIGNAL: rax 0x28 rbx 0x95064aa910 rcx 0x0 rdx 0x8 rsi 0x35e78b3f57c rdi 0x28

2020-05-07T15:23:09+08:00[+0.027]| vmx| I125:         r8 0x0 r9 0x7 r10 0xfffffffffffff547 r11 0x9547e920d7 r12 0x35e78b3f530 r13            0x950594d9eb r14 0x35e78b3f532 r15 0x0

> It looks like the change is in the dest side, so 6.5 U2 to U3 should have used the improved code,

> but it looks like there will be some cases where the change will still not prevent a migration failure

> but would hopefully improve the error reporting.

Attachments are the vmware.log and vmx zdump file.

The src (IP: 11.11.11.226) is a 6.5U2 host, while dst (IP: 11.11.11.191) is a 6.5U3 host.

Please help take a look.

Thanks,

Ping

Reply
0 Kudos
harderock
Contributor
Contributor

+ vmware.log & vmx zdump

Reply
0 Kudos
harderock
Contributor
Contributor

dariusd

Hi Darius,

Could you please take a look at the log and vmx zdump of the new test described in comment #7 & #8?

Thanks,

Ping​

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

Apologies for the delay... I did take a look at this when you posted it, but I must have forgotten to post back here.

It's certainly the same point of failure, but I don't understand why it is still happening with 6.5 U3.  I do not have any good answers for you, I'm afraid.

--

Darius

Reply
0 Kudos
harderock
Contributor
Contributor

Thanks Darius.

You mentioned that the discussion is ongoing internally whether the fix could address the issue under all condition, it looks to me this case happens to provide a counter-example. Hopefully it could help VMware R&D team fix this issue further.

Could you please post here if there is update on this issue in future?

Appreciate your support!

Ping

Reply
0 Kudos