VMware {code} Community
Fifty8
Contributor
Contributor

VDDK 6.7 EP1 - Crash on VixDiskLib_Close

I ran into this issue randomly until I discovered the following use case in which it occurs every time.

Setup:
1. Have one VM, called VM1, that has 2 SCSI disks: 10+ GB for the guest OS and 1 GB for data.
2. Clone this VM. The clone is called VM2.

Steps to reproduce:
1. Back up VM1 and VM2 using a backup app.
2. Restore VM1 and VM2, at the same time, to the same vCenter, in the same VM folder, on the same ESX host and on the same datastore.
The transport mode used to write data to the VM disks is nbdssl:nbd. The vCenter can be any version between 5.5 and 6.7.

What happens: both VMs are partially restored, meaning only the 1GB disk is completely restored on both VMs. The application crashes as soon as the 1GB disk is closed on the 2nd VM.

****

Notes regarding the implementation used on restore:
1. The 2 VMs are created on the destination vCenter on the Main thread.
2. The VM disks are handled concurrently, one thread per disk. So in this case there are 4 threads T1,.., T4 doing the writes on the 4 disks. The Main thread waits for these threads to finish.
3. The disks are opened and closed on a dedicated thread, as instructed in the VDDK documentation.
4. The writes are synchronous (using VixDiskLib_Write)
5. The connection parameters are allocated using VixDiskLib_AllocateConnectParams.

The timeline of VDDK operations per thread:
1. Main: VM1 and VM2 are created
2. T1: open connection C1 to VM1, wait for OS disk to open, start writing data on disk
3. T2: wait for VM1 data disk to open on C1, start writing data on disk
4. T3: open connection C2 to VM2, wait for OS disk to open, start writing data on disk
5. T4: wait for VM2 data disk to open on C2, start writing data on disk
6. T2: finish writing to VM1 data disk, wait for data disk to close
7. T4: finish writing to VM2 data disk, wait for data disk to close
The crash occurs when closing the VM2 data disk.

Also noticed something related to the transport modes. In my early tests prior to this scenario, when the app did not crash, all disks were opened using nbdssl - meaning this is what VixDiskLib_GetTransportMode returned for every disk. In this scenario though - when the app crashes - the OS disks are opened using nbdssl and the data disks are opened using nbd. I suspect this is due to timing and this scenario somehow ensures the VDDK operations are performed in a sequence that makes this issue occur.

***

Questions:

1. Given the transport modes used to connect to VMs are always nbdssl:nbd, is it normal that the OS disk is opened using nbdssl and the data disk is opened using nbd?
2. Is there a new multithreading requirement in VDDK 6.7 that could prevent this crash?

0 Kudos
0 Replies