VMware Cloud Community
oey
Contributor
Contributor

VMDK and VMX missing after RAID-controller failure

ESXi 6.0.0 host stopped with purple screen of death and the message:

"LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor."

Dell support replaced the motherboard and PERC card.

When booting ESXi, two VMs are inaccessible. Browsing datastore shows that VMDK is only 0,5 KB, VMX.LCK is 0,00 KB. FLAT-VMDK is missing...

vmfs-fuse is showing the same files as datastore browser.

Tried "DiskInternals VMFS Recovery", and after hours of scanning disk for lost files it shows flat-vmdk as "deleted file" and correct file size. Also it found vmx file with 4.4KB - seems correct.

 

Does anyone have experience with this software? I do not have a license, and the software cant recover files without a license.

Are there any other options? Is it possible to recover these files without costly software?

Reply
0 Kudos
6 Replies
vmrale
Expert
Expert

Hi,

you should give me more information about RAID array configuration for example RAID level 0, 1, 5 etc. Is VMFS datastore configured on the same disk where ESXi was installed? Are there any other datastores configured? Is host added to vCenter Server inventory?

It seems like VMX, VMDK (descriptor) files exist and are accessible, because VMs are present in host inventory, but for some reason VMDK (flat-file) is missing. Probably some VM folder is located on one datastore and some files (flat VMDK) was on another one. If VM is market as "Inaccessible" it's a problem with storage. This host is unable to access a datastore with the flat file.

 

Regards
Radek

If you think your question have been answered correctly, please consider marking it as a solution or rewarding me with kudos.
Reply
0 Kudos
oey
Contributor
Contributor

Thank you for your reply!

ESXi is installed on dual SD-cards, separate from the RAID controller.

I have two datastores, both are local. Host is not added to vCenter.

Datastore 1: "System", 558.38GB, VMFS5, Two physical disks, RAID1

Datastore 2: "Data1", 3.64TB, VMFS5, Two physical disks, RAID1

 

I see files missing from both datastores.

Reply
0 Kudos
vmrale
Expert
Expert

Why did you name the datastores System and Data? Is it because you planed to store VM VMDKs for OSs on System datastore and application data on the other one? I suspect that some VM files were spread between these two datastores and after RAID controller was changed maybe one of these RAID disk groups was initiated and the old datastore was wiped out.

These are just my suspicions what went wrong.

Did you check if some VMs are running but when you login in to Guest OS there are no data disk attached?

Regards
Radek

If you think your question have been answered correctly, please consider marking it as a solution or rewarding me with kudos.
Reply
0 Kudos
oey
Contributor
Contributor

For the naming of datastores you are correct. System files separated from file shares and dbs.

One of the VMs that are affected was utilizing both datastores, and the vmdk file is missing in both datastores.

To clarify: Files were missing after the PSOD. No change in files when RAID controller was changed.

Reply
0 Kudos
vmrale
Expert
Expert

Take a look at these articles first.

https://kb.vmware.com/s/article/2036767

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.storage.doc/GUID-6F991DB5-9AF0-4F9F...

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.storage.doc/GUID-4460A049-11BF-4924...

If they don't return any results other zero, that will be great. Try to look for missing files on existing datastores. You could use them to fix some errors too, but you should backup your data first.

If files like VMX or VMDK descriptor are broken, then there are procedures to rebuild them.

https://kb.vmware.com/s/article/1002294

https://kb.vmware.com/s/article/1002511

If everything I mentioned fails, you will have to restore the affected VMs from a backup.

 

Regards
Radek

If you think your question have been answered correctly, please consider marking it as a solution or rewarding me with kudos.
Reply
0 Kudos
oey
Contributor
Contributor

Running voma -f check and -f fix gave a lot of errors.
The "fixing" took care of a few of them, but the only errors that were fixes was this kind:

Found stale lock [type 10c00001 offset 78708736 v 63, hb offset 3219456
gen 89, mode 1, owner 5fb667e6-bc0f4452-dba3-141877464cbc mtime 1028
num 0 gblnum 0 gblgen 0 gblbrk 0]

After "fix" and "check" again, log looks like this
All the missing files are listed here in Phase 4, with "Invalid direntry"

System:

Checking if device is actively used by other hosts
Running VMFS Checker version 1.2 in default mode
Initializing LVM metadata, Basic Checks will be done
Phase 1: Checking VMFS header and resource files
Detected VMFS file system (labeled:'System') with UUID:5671255b-3b21621a-0c7c-141877464cbc, Version 5:61
ON-DISK ERROR: Cluster number 0 should be 3570
ON-DISK ERROR: Cluster 3570 total resources 0 should be 16
ON-DISK ERROR: Cluster number 0 should be 17
ON-DISK ERROR: Cluster 17 total resources 0 should be 200
ON-DISK ERROR: Cluster 17 free count 0 should be 7
ON-DISK ERROR: Cluster number 0 should be 18
... a lot of lines like this ...

Phase 2: Checking VMFS heartbeat region
Phase 3: Checking all file descriptors.
ON-DISK ERROR: <FD c130 r26> corrupted address <INVALID c0 r0>
ON-DISK ERROR: <FD c130 r26> invalid generation 0
ON-DISK ERROR: <FD c130 r26> : invalid FD type 0x0
ON-DISK ERROR: <FD c130 r27> corrupted address <INVALID c0 r0>
ON-DISK ERROR: <FD c130 r27> invalid generation 0
ON-DISK ERROR: <FD c130 r27> : invalid FD type 0x0
ON-DISK ERROR: <FD c130 r34> corrupted address <INVALID c0 r0>
...
Phase 4: Checking pathname and connectivity.
ON-DISK ERROR: Invalid direntry <MissingVM01.vmx, 130, 26>
ON-DISK ERROR: Invalid direntry <MissingVM01-flat.vmdk, 130, 27>
ON-DISK ERROR: Invalid direntry <MissingVM01.nvram, 130, 41>
ON-DISK ERROR: Invalid direntry <MissingVM01.vmx~, 130, 70>
ON-DISK ERROR: Invalid direntry <vmware.log, 130, 71>
ON-DISK ERROR: Invalid direntry <MissingVM02.vmx, 130, 63>
ON-DISK ERROR: Invalid direntry <MissingVM02-flat.vmdk, 130, 64>
ON-DISK ERROR: Invalid direntry <MissingVM02.nvram, 130, 73>
ON-DISK ERROR: Invalid direntry <MissingVM02-5e6625e3.vswp, 130, 39>
ON-DISK ERROR: Invalid direntry <MissingVM02.vmx~, 130, 68>
ON-DISK ERROR: Invalid direntry <vmx-MissingVM02-1583752675-1.vswp, 130, 34>
ON-DISK ERROR: Invalid direntry <vmware.log, 130, 69>
Phase 5: Checking resource reference counts.
ON-DISK ERROR: PB inconsistency found: (3570,0) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,1) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,2) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,3) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,4) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,5) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,6) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,7) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3570,8) allocated in bitmap, but never used
... a lot of lines like this ...

Total Errors Found: 172804


Data1

Checking if device is actively used by other hosts
Running VMFS Checker version 1.2 in check mode
Initializing LVM metadata, Basic Checks will be done
Phase 1: Checking VMFS header and resource files
Detected VMFS file system (labeled:'Data1') with UUID:567113cd-d5fef9d4-2934-141877464cbc, Version 5:61
Phase 2: Checking VMFS heartbeat region
Phase 3: Checking all file descriptors.
ON-DISK ERROR: <FD c130 r5> corrupted address <INVALID c0 r0>
ON-DISK ERROR: <FD c130 r5> invalid generation 0
ON-DISK ERROR: <FD c130 r5> : invalid FD type 0x0
Phase 4: Checking pathname and connectivity.
ON-DISK ERROR: Invalid direntry <MissingVM01_2-flat.vmdk, 130, 5>
Phase 5: Checking resource reference counts.
ON-DISK ERROR: PB inconsistency found: (3730,10) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (3730,11) allocated in bitmap, but never used
... a lot of lines like this ...

Total Errors Found: 512506

(I have renamed VMs to MissingVM01 and MissingVM02)

Reply
0 Kudos