more then a year ago I was involved in a similar issue with ESX 2.5.x.
For unknown reason, one of our customers lost the partition tables of his vmfs Datastores.
We did check our storage array, but everything did work as it should.
It looks like a rolling desaster, one node after another was affected.
This behaivior was caused by our rescan activities.
VMWare support was finally able to rebuild all partition tables.
But the root cause, the responsible process for changing the partition table, was never found.
1 person found this helpful
Are you confident in your presentation & zoning? Is it possible these luns are open to other systems? If so a Windows system will grab & format this lun as it will come up first in the list of available devices for install.
There's a vmworld presentation on recovering the partition table. P.29 begins the vmfs3 recovery: http://mylearn.vmware.com/courseware/12027/PS_TA48_288131_166-1_FIN_v2.pdf
We also had one of my colleagues present on the topic at the PDX VMUG:
It's a good idea(tm) to get VMware support involved.
Also for us it's not the first time. Just like now in esx 3.5 we had thsi issues with esx 2.5.3 en esx 3.0.1.
If there was an award, we won it!
But the only thing I can add now is that I perform indeed some rescans on the esx servers.
One of the blade server seems defect, because when Iperform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez al things.
Zo my wild guess is that this action efects all machines with at the end a partition lost.
The problem with that kind of problem is that rescanning the disks make the situation even worse.
This is caused by the fact that an ESX Server usually doesn't reread the partition table during normal operations.
So if you have an unknown process which did clear the partition table, it's usually ignored by the host.
The datastores (or RDM) devices could be accessed as usual, the partition information seems to be cached in memory.
But when you initiate a rescan operation, the cached content is updated by the data from the disk.
This does mean, the datastore is gone for the host where you perform the rescan operation.
This could easily end in a rolling desaster, when you perform a rescan operation on the next host to verify if there's a problem with your first host.
Of course HA/DRS already starts to move VM's to different nodes of your cluster, but blood presure is climbing to new records.....
So, as usual in a critical situation, stay calm (really easy, right???)
As long as you're only loosing the partition table, you're fine.
Simply recreate it with fdisk, verify that a reread does find the datastore and finally enter a pub and drink your "tough guy's" beer
Jae Ellers, Als thanks for you answer. I almost know for sure that thsi RDM is only presented to this VM machine. But for sure i will check this again.
Ok it is only presented to the 8 esx servers in the storage group. So thats all.
Like I said befor. Is there a possibility that thsi story has something to do with it.
One of the blade server seems defect, because when I perform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez. The process hangs or something like that
So my wild guess is that this action efects all machines with at the end a partition lost.
hmm I'm afraid i can't follow you Ghost. Could you simplify it
In ESX 2.5.x days, when the partition table was cleared, the server still could access the datastore placed on that disk.
Maybe this is changed with 3.5, but I don't think so.
In our scenario, customer did use 6-8 ESX servers as a cluster.
The problem started on one server, he reported all of his shared LUN's with valid datastores to be blanck.
All other hosts using these shared LUN's still show the datastores including the consumed space.
We initiated a rescan on one of the remaining hosts, as a result this host does also looses the shared datastores.
Because the partition table was cleared, the rescan forced the host to reread it.
Therefor this host now does also see blank LUN's while the remaining hosts still see the datastore.
From a technical point of view, these two host display the correct information, because the partition table were cleared by an unidentified process.
Each of the remaining hosts would act identical if a rescan operation would be performed.
The recreation of the cleared partition table would fix the error, but not the root cause.
But now all hosts could perform a rescan operation without loosing access to their shared datastores.
Hope this statement does answer your questions, but I'm not a native english speaker.
Hi All, we've just experienced this with our new c7000 production environment running 4 x BL680c blades with emulex and boot from SAN, ESX3.5U1. A couple of daya ago 5 VM 2003 machines lost their RDM paritions with the volume showing up as unallocated in windows and the rdm having now partition table on the volume. Nothing in the logs apart from quite a few SCSI reseravtion conflicts which look like they may be related to the Insight agents on the server. It may have been when i was running a rescan but no indication of its cause and cannot replicate. Did you find a fix? I've logged a call with VMWare and awaiting reply but we have had to stop until the issue is resolved
It is still a issue why this happend. But the only solution is to rebuild it with the Fdisk command.
But for that you need to know the 'Device name' such as sda, sdb and so on.
Further on you need to know the start sector,mostly 128 and the ID (7 for ntfs)
But be carfuly with this command, because when you do it wrong all data can be lost.
Hop this helps.
Thanks, we already used that to recover the data. What storage back end do you use. Ive noticed that i can get the same error codes on the esx server when the partitons failed by unmapping a LUN from the initiator group. Its not corrupting the RDM's though when this happens but we are starting to see things happening. The VM's also start their Virutal Disk Service and Logical DIsk Manager when this occurs. Very strange
UNMAP VOL_DELETE LUN 33
Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.513 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x3f 0xe
Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.514 cpu4:2027)StorageMonitor: 196: vmhba1:1:0:0 status = 2/0 0x6 0x3f 0xe
Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.852 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x29 0x0
Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:2:0 status = 2/0 0x6 0x29 0x0
Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:3:0 status = 2/0 0x6 0x29 0x0
Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x3f 0xe
Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x29 0x0
Jun 20 11:46:08 cisqhsesx004 vmkernel: 16:18:34:51.156 cpu9:2146)StorageMonitor: 196: vmhba1:3:14:0 status = 2/0 0x6 0x29 0x0
Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead
Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead
Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead
Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead
> Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead
I believed we had the same error as showed above. but checking that resultes in alive paths. But again I can not see what happend on that time.
We have a EMC clariion CX500
Were you performing anything on the storage array at the time these went down, presenting luns or unpresenting luns. Rescannig the storage? I managed to re-create it once again today but not after that. The difference was we had another esx cluster sharing the same storage with the current cluster, create a VM, add some storage, shutdown the newly powered on cluster, unmap an unused lun and instantly 5 VM's lose their MBR . Vmware are saying this should not make any issue as esx controls the access to the lun's not the virtual center clustering. Determined to fix this
I was rescannig the storage at that time. And probely the Hba was defect! But also after replace the Hba thsi stil happend after rescannen. After that I replaced the mainbord of the server and again a new Hba.
After that this never happend again. So it seems the server where defect. Also Vmware said this could not be the problem.
Thanks Teovmy, Did this only start happening after using 3.5 on your blades and are you using Virtual Connect? Out of interest, how many hosts in the cluster and number of VM's. We just cant get the thing to replicate now and resorting to hitting it with a stick.