Solved: Fix corrupted GPT table

AlexxBVL · ‎08-21-2018

Hello

Some time ago I noticed a VERY large number of messages in the log of vmkernel.log similar to:
Partition: 648: Read from primary gpt table failed on "naa.600a...."

Almost all datastore devices(absolutly different datastores and LUNs) are listed in the log.

The output of the "partedUtil getptbl" command is:

Error: The primary GPT table is corrupt, but the backup appears OK, so that will be used. Fix primary table ? diskPath (/dev/disks/naa.600a...) diskSize (23622320128) AlternateLBA (1) LastUsableLBA (23622320094)

I have tried the "partedUtil fixGpt" command and it fixes the GPT.

However, I have questions:
1. How safe is it to use in a production environment?
2. What are the unpredictable consequences of this command?
3. What can happen if you ignore these messages?

4. How can I see what exactly is damaged in the Primary GPT?

Output of the "partedUtil getptbl" command before fixing and after is the same:

gpt

1229833 255 63 19757268992

ps ESXi 6.5 U2, datastores connect to hosts via FC

continuum · ‎08-21-2018

1. How safe is it to use in a production environment?

If you receive the error message: The primary GPT table is corrupt, but the backup appears OK, so that will be used.
as opposed to the message: The primary GPT table is corrupt/ missing
then this fix is the best thing you can do.Even better if you create a backup first by dumping the first MB of the volume to another datastore.
This will allow you to revert the fix in the improbable case something goes wrong.

2. What are the unpredictable consequences of this command?

In some really rare cases the size of the datastore is reported incorrectly. If you hit such a case you would not be able to mount the datastore again after a reboot.

In this case you would use the partedUtil commands that show the max size and should be able to adjust the size accordingly.

I would not recommend to run the command while the datastore is highly active with backups for example but other than that I am not aware of further unpredictable consequences.

3. What can happen if you ignore these messages?
In worst case the backup GPT table gets lost too - in this case you would have to create the partition from scratch - which is way less desirable but still manageable.
If both tables are bad and you reboot you will not be able to mount the datastore without recreating the partitiontable first.

4. How can I see what exactly is damaged in the Primary GPT?
You can run

hexdump -C /dev/disks/device | less
this will not be really helpful unless you eat hexdumps for supper.
A GPT-table uses a strict syntax and if only a few bits are wrong partedUtil will not display anything at all.
If you ask this because you are surprised why a modern OS would corrupt the partitiontable at all - consider that ESXi tries to keep info like the partitiontable in RAM most of the times.
So unpredictable events like powerfailures have more severe consequences as you are used to with OS like Windows for example.

Summary:

I regard replacing the bad primary table with the healthy backup table as one of the few well documented and safe options you have when dealing with VMFS-problems.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

View solution in original post

continuum · ‎08-21-2018

1. How safe is it to use in a production environment?

If you receive the error message: The primary GPT table is corrupt, but the backup appears OK, so that will be used.
as opposed to the message: The primary GPT table is corrupt/ missing
then this fix is the best thing you can do.Even better if you create a backup first by dumping the first MB of the volume to another datastore.
This will allow you to revert the fix in the improbable case something goes wrong.

2. What are the unpredictable consequences of this command?

In some really rare cases the size of the datastore is reported incorrectly. If you hit such a case you would not be able to mount the datastore again after a reboot.

In this case you would use the partedUtil commands that show the max size and should be able to adjust the size accordingly.

I would not recommend to run the command while the datastore is highly active with backups for example but other than that I am not aware of further unpredictable consequences.

3. What can happen if you ignore these messages?
In worst case the backup GPT table gets lost too - in this case you would have to create the partition from scratch - which is way less desirable but still manageable.
If both tables are bad and you reboot you will not be able to mount the datastore without recreating the partitiontable first.

4. How can I see what exactly is damaged in the Primary GPT?
You can run

hexdump -C /dev/disks/device | less
this will not be really helpful unless you eat hexdumps for supper.
A GPT-table uses a strict syntax and if only a few bits are wrong partedUtil will not display anything at all.
If you ask this because you are surprised why a modern OS would corrupt the partitiontable at all - consider that ESXi tries to keep info like the partitiontable in RAM most of the times.
So unpredictable events like powerfailures have more severe consequences as you are used to with OS like Windows for example.

Summary:

I regard replacing the bad primary table with the healthy backup table as one of the few well documented and safe options you have when dealing with VMFS-problems.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

AlexxBVL · ‎08-21-2018

Thank you for the detailed answer

Even better if you create a backup first by dumping the first MB of the volume to another datastore.
This will allow you to revert the fix in the improbable case something goes wrong.

Can you give an example of how I can dump and then load back the first megabyte from the partition?

(May be this? For dump: dd if=/vmfs/devices/disks/naa.ID of=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1)

If both tables are bad and you reboot you will not be able to mount the datastore without recreating the partitiontable first.

Do I understand correctly that when the host is rebooted, the problem with the datastore will only be on this host? Other hosts will continue to work with the datastore without any problems until they are rebooted?

continuum · ‎08-21-2018

You got it already !
(May be this? For dump: dd if=/vmfs/devices/disks/naa.ID of=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1)
that creates the backup. To revert use
dd of=/vmfs/devices/disks/naa.ID if=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1 conv=notrunc
> Do I understand correctly that when the host is rebooted, the problem with the datastore will only be on this host?
> Other hosts will continue to work with the datastore without any problems until they are rebooted?
So you have a VMFS-volume on shared storage in a cluster ?
This sometimes can have strange effects in a cluster. The situation may look infectious and appears to be deteriorating accross the cluster.
Keep cool: try to isolate the datastore if possible to a single host. Then do the fix there and reboot that single host. If that is not possible do the fix and reboot each host as soon as production allows.
But I have not seen such issues in quite a while - I saw them more frequently with ESXi 5.x.
Basically an ESXi host should be able to continue operation if the partitiontable gets lost after the host has finished booting.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

AlexxBVL · ‎08-22-2018

Thank you again for helping to understand how it works

b1stern · ‎02-20-2021

I'm having similar problems with two of my storage volumes on a Storewize v7000 storage system. Other volumes are fine. I attempted to repair the volumes. I'm not seeing errors on the ESXi hosts. But, if I attempt to add the storage volume vSphere says that it will create a NEW datastore and will wipe out the data on the volume.

Is there any way for me to save the data on the Volume? I have multiple VMs stored there.

continuum · ‎02-20-2021

Dont create MeToo posts for problems like this.
Create a new post instead and provide as much details as possible.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Ffuller · ‎05-16-2021

Have you been able to correct this error ? I have the same issue on a v7000, many VMs on the datastore and the GPT table corrupt, I need to recover them asap.

continuum · ‎05-21-2021

Sounds like you have a more serious problem - if many vmdks dont match their nominated size check if the VMFS itself is still healthy.

If you fix corrupt GPT tables only using either original or backup GPT table then you can easily make matters worse.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

All

Fix corrupted GPT table