ESXi 5.5 issue with powering on a VM: "An error w...

littlefixit · ‎01-15-2016

Please note, I have already read and gleaned as much information as possible from the similarly named post found here: ESXi 5.5 issue with powering on a VM: "An error was received from the ESX host while powering on VM...

Upon attempting to start up my VM, I received the following error message:

The contents of the error report are as follows:

Task Details:

Name: PowerOnVM_Task

Status: 22 (Invalid argument)

Description: Power On virtual machine

Start Time: 1/14/2016 9:56:04 AM

Completed Time: 1/14/2016 9:56:05 AM

State: Error

Error Stack:

Failed to start the virtual machine.

Module DiskEarly power on failed.

Cannot open the disk '/vmfs/volumes/5385d8c0-46872e4d-006f-bc5ff4756e42/nas02.littlefixit.com/nas02.littlefixit.com_13.vmdk' or one of the snapshot disks it depends on.

22 (Invalid argument)

Additional Task Details:

Host Build: 1623387
Error Type: GenericVmConfigFault
Task Id: Task
Cancelable: True
Cancelled: False
Description Id: VirtualMachine.powerOn
Event Chain Id: 157763039
Progress: 100

From an ESXi CLI, I executed the following command: esxcli storage vmfs extent list

~ # esxcli storage vmfs extent list

Volume Name VMFS UUID Extent Number Device Name Partition

----------- ----------------------------------- ------------- -------------------------------------------------------------------------- ---------

areca_sda0 5385d8c0-46872e4d-006f-bc5ff4756e42 0 eui.001b4d23013d4800 1

areca_sda2 5385d8ed-f5a5b613-4cde-bc5ff4756e42 0 eui.001b4d23013d4802 1

datastore2 5385d1ab-5827eaf4-5e9d-bc5ff4756e42 0 t10.ATA_____WDC_WD1600JS2D55MHB1__________________________WD2DWCANM4579281 1

datastore1 5384634c-0d7183cb-5390-bc5ff4756e42 0 t10.ATA_____SC2_mSATA_SSD___________________________E1E0073B122900001171 3

areca_sda3 5385d901-d1456c44-fadc-bc5ff4756e42 0 eui.001b4d23013d4803 1

areca_sda1 5385d8d4-38a33d99-6fb3-bc5ff4756e42 0 eui.001b4d23013d4801 1

Using this information (above) to identify the disk I needed to run VOMA on, I executed the following command: voma -m vmfs -f check -d /vmfs/devices/disks/eui.001b4d23013d4800:1

~ # voma -m vmfs -f check -d /vmfs/devices/disks/eui.001b4d23013d4800:1

Checking if device is actively used by other hosts

Running VMFS Checker version 1.0 in check mode

Initializing LVM metadata, Basic Checks will be done

Phase 1: Checking VMFS header and resource files

Detected VMFS file system (labeled:'areca_sda0') with UUID:5385d8c0-46872e4d-006f-bc5ff4756e42, Version 5:60

Phase 2: Checking VMFS heartbeat region

ON-DISK ERROR: Invalid HB address <433795526656>

Phase 3: Checking all file descriptors.

Found stale lock [type 10c00001 offset 95934464 v 307, hb offset 3829760

gen 831, mode 1, owner 568a5d55-050b9c06-c1c5-bc5ff4756e42 mtime 525

num 0 gblnum 0 gblgen 0 gblbrk 0]

Phase 4: Checking pathname and connectivity.

Phase 5: Checking resource reference counts.

ON-DISK ERROR: FB inconsistency found: (3229,0) allocated in bitmap, but never used

Total Errors Found: 2

For kicks and giggles, I ran this against a second disk just to see what the output of VOMA would be: voma -m vmfs -f check -d /vmfs/devices/disks/eui.001b4d23013d4801:1

~ # voma -m vmfs -f check -d /vmfs/devices/disks/eui.001b4d23013d4801:1

Checking if device is actively used by other hosts

Running VMFS Checker version 1.0 in check mode

Initializing LVM metadata, Basic Checks will be done

Phase 1: Checking VMFS header and resource files

Detected VMFS file system (labeled:'areca_sda1') with UUID:5385d8d4-38a33d99-6fb3-bc5ff4756e42, Version 5:60

Phase 2: Checking VMFS heartbeat region

Phase 3: Checking all file descriptors.

Phase 4: Checking pathname and connectivity.

Phase 5: Checking resource reference counts.

Total Errors Found: 0

Some Context:

I am running an Areca RAID Controller with 4x Western Digital 2TB SE SATA Drives. Yes, I'm aware they're not ideal compare to SAS, but for a home lab, this was more than adequate. In addition to these drives, I have a 160GB SATA Drive and a 32GB SSD Drive available in ESXi.

I allocated all four of the four 2TB Drives, 20GB of the 32GB SSD, and ~150GB of the 160GB Hard Drive to the nas02.littlefixit.com Virtual Machine.

Inside the Virtual Machine, I am running Nas4Free, and have provisioned a RAIDZ2 Pool, leveraging the 20GB SSD storage device for cache, the 150GB storage device for logs, and all the remaining storage (4x 2TB = 8TB) to data. This left me with ~3.75TB of useable space when all was said and done to use for various accumulations of media, backups, etc.

My Question:

Since there isn't any official documentation or guide (anywhere) that I've been able to find on how to fix it automatically or manually, and every knowledgebase article I've encountered has said to open a support ticket with VMWare (for the modest fee of only $300 per support incident request--though you can save if you buy 3- or 5-packs... /facepalm), my options are extremely limited.

1.) Safest, most expensive: Cough up the $300, and pay for someone from VMWare to fix it.

This is my least desirable option, for several reasons--but the trump card is that in this case, I have a year's worth of data that my wife would really rather I not gamble with. Trump card, and all that.

2.) The blunt-force-trauma approach: remove the disk [Hard disk 4, see image above] from the virtual machine, and let the software RAIDZ2 inside NAS4FREE attempt to deal with it.

Honestly, I only suspect this would work, but given risk I've been reluctant to simply remove the it from the VM Configuration and try. My suspicion is that this would allow the VM to boot, and that I'd simply be

alerted to the problem of a missing disk once NAS4FREE started up, and I could proceed by shutting down the virtual OS, removing the nas02.littlefixit.com folder containing 'nas02.littlefixit.com_13.vmdk' on areca_sda0, re-adding it once the removal was complete, and attempting to turn the VM back on to let the RAIDZ2 have a crack at rebuilding everything for me. I'm only reluctant because in *theory* it *should* work, but I've never tried it.

3.) The too-smart-for-my-own-good approach: find the corrupted heartbeat, and attempt to fix it using one of the other disks as a template.

There is an article I stumbled on where someone was able to locate WHERE the corrupted heartbeat data is in the Virtual Machine metadata. virtualpatel.blogspot.com: How to verify VMFS Heartbeat Region Corruption Theoretically, it might be possible to reverse this process--and instead of corrupting the data, overwrite the corresponding location with valid data from another disk. In fact, I suspect this is what VMWare technicians are doing behind the scenes when they solve the problem for us, and what they've taught VOMA to do in v6.0 (which doesn't help me, in v5.5). Of course I realize they're probably doing things to fix the data as well, but in my case, I'm not terribly worried about the data as I expect the software RAIDZ2 to resolve any inconsistencies on that front, if it's actually a problem at all. Looking over my VOMA output, above, however, I'm not getting any indication that any data is actually corrupt, though--only that due to the power outage I experienced, the hearbeat and some (minor) piece of data wasn't written. Since the machine was basically idling at the time, and nothing was being written to the drives (no file transfers in progress, etc.), this reinforces my belief that just fixing the heartbeat (In MY case) would solve the issue. I am in no way advocating this solution for everyone out there, but here it seems plausible if only due to the fact that I have a software RAID in play.

Any feedback on which direction I should proceed would be appreciated.

All

ESXi 5.5 issue with powering on a VM: "An error was received from the ESX host while powering on VM" "22 (Invalid argument)" "Module DiskEarly power on failed."