Lost Partition Table of RDM

Teovmy · ‎04-08-2008

We got a realy strange thing.

We have a HP blade system C7000 with 8 BL460C ESX 3.5 servers ( i just updated from 3.0.2 to 3.5 with a fress install)

We have emulex HBA cards in it.

Now it happend the fourth time in two weeks that we lost the partition tables of four RDM's with no reason.

2 RDM's are connected to our file server and 2 RDM's are connected to the exchange server. We have more than these two server who has rdm's connected but the strange thing is it only happen to these two windows 2003 servers.

Only these servers has acces to the Raw Luns, no other serves. So there is a lot to tell but luckly we can fix it quick with fdisk but it is realy an issue!!

I calles vmware and looked with an engineer at the log files of the esx server. realy nothing to see there. Also the eventlog in the vm itself shows nothing. only that he lost the diskinformation.

Because the disk is still presented but not readable.

Regards. @teovmy http://www.mikes.eu

kastlr · ‎04-08-2008

Hi,

more then a year ago I was involved in a similar issue with ESX 2.5.x.

For unknown reason, one of our customers lost the partition tables of his vmfs Datastores.

We did check our storage array, but everything did work as it should.

It looks like a rolling desaster, one node after another was affected.

This behaivior was caused by our rescan activities.

VMWare support was finally able to rebuild all partition tables.

But the root cause, the responsible process for changing the partition table, was never found.

Hope this helps a bit.
Greetings from Germany. (CEST)

Jae_Ellers · ‎04-08-2008

Are you confident in your presentation & zoning? Is it possible these luns are open to other systems? If so a Windows system will grab & format this lun as it will come up first in the list of available devices for install.

There's a vmworld presentation on recovering the partition table. P.29 begins the vmfs3 recovery:

We also had one of my colleagues present on the topic at the PDX VMUG:

It's a good idea(tm) to get VMware support involved.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Teovmy · ‎04-09-2008

Hi Ghost.

Also for us it's not the first time. Just like now in esx 3.5 we had thsi issues with esx 2.5.3 en esx 3.0.1.

If there was an award, we won it!

But the only thing I can add now is that I perform indeed some rescans on the esx servers.

One of the blade server seems defect, because when Iperform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez al things.

Zo my wild guess is that this action efects all machines with at the end a partition lost.

Regards. @teovmy http://www.mikes.eu

kastlr · ‎04-09-2008

The problem with that kind of problem is that rescanning the disks make the situation even worse.

This is caused by the fact that an ESX Server usually doesn't reread the partition table during normal operations.

So if you have an unknown process which did clear the partition table, it's usually ignored by the host.

The datastores (or RDM) devices could be accessed as usual, the partition information seems to be cached in memory.

But when you initiate a rescan operation, the cached content is updated by the data from the disk.

This does mean, the datastore is gone for the host where you perform the rescan operation.

This could easily end in a rolling desaster, when you perform a rescan operation on the next host to verify if there's a problem with your first host.

Of course HA/DRS already starts to move VM's to different nodes of your cluster, but blood presure is climbing to new records.....

So, as usual in a critical situation, stay calm (really easy, right???)

As long as you're only loosing the partition table, you're fine.

Simply recreate it with fdisk, verify that a reread does find the datastore and finally enter a pub and drink your "tough guy's" beer

Hope this helps a bit.
Greetings from Germany. (CEST)

Teovmy · ‎04-09-2008

Jae Ellers, Als thanks for you answer. I almost know for sure that thsi RDM is only presented to this VM machine. But for sure i will check this again.

Ok it is only presented to the 8 esx servers in the storage group. So thats all.

Like I said befor. Is there a possibility that thsi story has something to do with it.

-

One of the blade server seems defect, because when I perform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez. The process hangs or something like that

So my wild guess is that this action efects all machines with at the end a partition lost.

Regards. @teovmy http://www.mikes.eu

Teovmy · ‎04-09-2008

hmm I'm afraid i can't follow you Ghost. Could you simplify it

Regards. @teovmy http://www.mikes.eu

kastlr · ‎04-09-2008

In ESX 2.5.x days, when the partition table was cleared, the server still could access the datastore placed on that disk.

Maybe this is changed with 3.5, but I don't think so.

In our scenario, customer did use 6-8 ESX servers as a cluster.

The problem started on one server, he reported all of his shared LUN's with valid datastores to be blanck.

All other hosts using these shared LUN's still show the datastores including the consumed space.

We initiated a rescan on one of the remaining hosts, as a result this host does also looses the shared datastores.

Because the partition table was cleared, the rescan forced the host to reread it.

Therefor this host now does also see blank LUN's while the remaining hosts still see the datastore.

From a technical point of view, these two host display the correct information, because the partition table were cleared by an unidentified process.

Each of the remaining hosts would act identical if a rescan operation would be performed.

The recreation of the cleared partition table would fix the error, but not the root cause.

But now all hosts could perform a rescan operation without loosing access to their shared datastores.

Hope this statement does answer your questions, but I'm not a native english speaker.

Hope this helps a bit.
Greetings from Germany. (CEST)

LeeCarey · ‎06-19-2008

Hi All, we've just experienced this with our new c7000 production environment running 4 x BL680c blades with emulex and boot from SAN, ESX3.5U1. A couple of daya ago 5 VM 2003 machines lost their RDM paritions with the volume showing up as unallocated in windows and the rdm having now partition table on the volume. Nothing in the logs apart from quite a few SCSI reseravtion conflicts which look like they may be related to the Insight agents on the server. It may have been when i was running a rescan but no indication of its cause and cannot replicate. Did you find a fix? I've logged a call with VMWare and awaiting reply but we have had to stop until the issue is resolved

Cheers

L

Teovmy · ‎06-19-2008

Hi LeeCarey

It is still a issue why this happend. But the only solution is to rebuild it with the Fdisk command.

But for that you need to know the 'Device name' such as sda, sdb and so on.

Further on you need to know the start sector,mostly 128 and the ID (7 for ntfs)

But be carfuly with this command, because when you do it wrong all data can be lost.

Hop this helps.

Regards. @teovmy http://www.mikes.eu

LeeCarey · ‎06-20-2008

Thanks, we already used that to recover the data. What storage back end do you use. Ive noticed that i can get the same error codes on the esx server when the partitons failed by unmapping a LUN from the initiator group. Its not corrupting the RDM's though when this happens but we are starting to see things happening. The VM's also start their Virutal Disk Service and Logical DIsk Manager when this occurs. Very strange

UNMAP VOL_DELETE LUN 33

Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.513 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x3f 0xe

Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.514 cpu4:2027)StorageMonitor: 196: vmhba1:1:0:0 status = 2/0 0x6 0x3f 0xe

Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.852 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x29 0x0

Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:2:0 status = 2/0 0x6 0x29 0x0

Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:3:0 status = 2/0 0x6 0x29 0x0

Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x3f 0xe

Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x29 0x0

Jun 20 11:46:08 cisqhsesx004 vmkernel: 16:18:34:51.156 cpu9:2146)StorageMonitor: 196: vmhba1:3:14:0 status = 2/0 0x6 0x29 0x0

Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead

Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead

Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead

Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead

Teovmy · ‎06-20-2008

> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead

I believed we had the same error as showed above. but checking that resultes in alive paths. But again I can not see what happend on that time.

We have a EMC clariion CX500

Regards. @teovmy http://www.mikes.eu

LeeCarey · ‎06-24-2008

Were you performing anything on the storage array at the time these went down, presenting luns or unpresenting luns. Rescannig the storage? I managed to re-create it once again today but not after that. The difference was we had another esx cluster sharing the same storage with the current cluster, create a VM, add some storage, shutdown the newly powered on cluster, unmap an unused lun and instantly 5 VM's lose their MBR . Vmware are saying this should not make any issue as esx controls the access to the lun's not the virtual center clustering. Determined to fix this

Teovmy · ‎06-25-2008

I was rescannig the storage at that time. And probely the Hba was defect! But also after replace the Hba thsi stil happend after rescannen. After that I replaced the mainbord of the server and again a new Hba.

After that this never happend again. So it seems the server where defect. Also Vmware said this could not be the problem.

Regards. @teovmy http://www.mikes.eu

LeeCarey · ‎06-30-2008

Thanks Teovmy, Did this only start happening after using 3.5 on your blades and are you using Virtual Connect? Out of interest, how many hosts in the cluster and number of VM's. We just cant get the thing to replicate now and resorting to hitting it with a stick.

Cheers

L

LeeCarey · ‎06-30-2008

Sorry for another reply but would you have the SR number that you raised with VMWare so they can cross reference yours with our current one.

Teovmy · ‎07-03-2008

Sorry for my late reply.

It's VMware Support Request SR# 1109266951

Try that

Regards. @teovmy http://www.mikes.eu

Teovmy · ‎07-03-2008

Sorry for my late reply. a short holliday

VMware Support Request SR# 1109266951

regards,

Regards. @teovmy http://www.mikes.eu

Teovmy · ‎07-03-2008

Sorry for my late reply. a short holliday

If I'm right. VMware Support Request SR# 1109266951

regards,

Regards. @teovmy http://www.mikes.eu

LeeCarey · ‎07-03-2008

Holiday, is that allowed? Many Thanks, will forward it on to VMware for any similarities

Cheers

L