1 2 Previous Next 21 Replies Latest reply on Dec 17, 2008 5:58 PM by Toots

    Lost Partition Table of RDM

    Teovmy Enthusiast

      We got a realy strange thing.

       

      We have a HP blade system C7000 with 8  BL460C ESX 3.5 servers ( i just updated from 3.0.2 to 3.5 with a fress install)

       

      We have emulex HBA cards in it.

       

       

       

      Now it happend the fourth time in two weeks that we lost the partition tables of four RDM's with no reason.

       

       

       

       

      2 RDM's are connected to our file server and 2 RDM's are connected to the exchange server. We have more than these two server who has rdm's connected but the strange thing is it only happen to these two windows 2003 servers.

       

       

       

       

      Only these servers has acces to the Raw Luns, no other serves. So there is a lot to tell but luckly we can fix it quick with fdisk but it is realy an issue!!

       

       

       

      I calles vmware and looked with an engineer at the log files of the esx server. realy nothing to see there. Also the eventlog in the vm itself shows nothing. only that he lost the diskinformation.

       

       

       

      Because the disk is still presented but not readable.

        • 1. Re: Lost Partition Table of RDM
          kastlr Expert

          Hi,

           

          more then a year ago I was involved in a similar issue with ESX 2.5.x.

          For unknown reason, one of our customers lost the partition tables of his vmfs Datastores.

          We did check our storage array, but everything did work as it should.

           

          It looks like a rolling desaster, one node after another was affected.

          This behaivior was caused by our rescan activities.

           

          VMWare support was finally able to rebuild all partition tables.

          But the root cause, the responsible process for changing the partition table, was never found.

          • 2. Re: Lost Partition Table of RDM
            Jae Ellers Master

             

            Are you confident in your presentation & zoning?  Is it possible these luns are open to other systems?  If so a Windows system will grab & format this lun as it will come up first in the list of available devices for install.

             

             

            There's a vmworld presentation on recovering the partition table.  P.29 begins the vmfs3 recovery: http://mylearn.vmware.com/courseware/12027/PS_TA48_288131_166-1_FIN_v2.pdf

             

             

            We also had one of my colleagues present on the topic at the PDX VMUG:

             

             

            It's a good idea(tm) to get VMware support involved.

             

             

            -=-=-=-=-=-=-=-=-=-=-=-=-=-=-

            http://blog.mr-vm.com/

            http://www.vmprofessional.com/

            -=-=-=-=-=-=-=-=-=-=-=-=-=-=-

             

             

            1 person found this helpful
            • 3. Re: Lost Partition Table of RDM
              Teovmy Enthusiast

               

              Hi Ghost.

               

               

              Also for us it's not the first time. Just like now in esx 3.5 we had thsi issues with esx 2.5.3 en esx 3.0.1.

               

               

              If there was an award, we won it!

               

               

              But the only thing I can add now is that I perform indeed some rescans on the esx servers.

               

               

              One of the blade server seems defect, because when Iperform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez al things.

               

               

              Zo my wild guess is that this action efects all machines with at the end a partition lost.

               

               

              • 4. Re: Lost Partition Table of RDM
                kastlr Expert

                The problem with that kind of problem is that rescanning the disks make the situation even worse.

                This is caused by the fact that an ESX Server usually doesn't reread the partition table during normal operations.

                 

                So if you have an unknown process which did clear the partition table, it's usually ignored by the host.

                The datastores (or RDM) devices could be accessed as usual, the partition information seems to be cached in memory.

                 

                But when you initiate a rescan operation, the cached content is updated by the data from the disk.

                This does mean, the datastore is gone for the host where you perform the rescan operation.

                 

                This could easily end in a rolling desaster, when you perform a rescan operation on the next host to verify if there's a problem with your first host.

                Of course HA/DRS already starts to move VM's to different nodes of your cluster, but blood presure is climbing to new records.....

                 

                So, as usual in a critical situation, stay calm (really easy, right???)

                 

                As long as you're only loosing the partition table, you're fine.

                Simply recreate it with fdisk, verify that a reread does find the datastore and finally enter a pub and drink your "tough guy's" beer

                1 person found this helpful
                • 5. Re: Lost Partition Table of RDM
                  Teovmy Enthusiast

                   

                  Jae Ellers, Als thanks for you answer. I almost know for sure that thsi RDM is only presented to this VM machine. But for sure i will check this again.

                   

                   

                  Ok it is only presented to the 8 esx servers in the storage group. So thats all.

                   

                   

                  Like I said befor. Is there a possibility that thsi story has something to do with it.

                   

                   

                  -


                   

                   

                  One of the blade server seems defect, because when I perform a rescan on de hba2 it works al well, and when i do this on hba1 it seems to freez. The process hangs or something like that

                   

                   

                  So my wild guess is that this action efects all machines with at the end a partition lost.

                   

                   

                  • 6. Re: Lost Partition Table of RDM
                    Teovmy Enthusiast

                    hmm I'm afraid i can't follow you Ghost. Could you simplify it

                    • 7. Re: Lost Partition Table of RDM
                      kastlr Expert

                      In ESX 2.5.x days, when the partition table was cleared, the server still could access the datastore placed on that disk.

                      Maybe this is changed with 3.5, but I don't think so.

                       

                      In our scenario, customer did use 6-8 ESX servers as a cluster.

                      The problem started on one server, he reported all of his shared LUN's with valid datastores to be blanck.

                      All other hosts using these shared LUN's still show the datastores including the consumed space.

                       

                      We initiated a rescan on one of the remaining hosts, as a result this host does also looses the shared datastores.

                      Because the partition table was cleared, the rescan forced the host to reread it.

                      Therefor this host now does also see blank LUN's while the remaining hosts still see the datastore.

                       

                      From a technical point of view, these two host display the correct information, because the partition table were cleared by an unidentified process.

                      Each of the remaining hosts would act identical if a rescan operation would be performed.

                       

                      The recreation of the cleared partition table would fix the error, but not the root cause.

                      But now all hosts could perform a rescan operation without loosing access to their shared datastores.

                       

                      Hope this statement does answer your questions, but I'm not a native english speaker.

                      • 8. Re: Lost Partition Table of RDM
                        LeeCarey Novice

                         

                        Hi All, we've just experienced this with our new c7000 production environment running 4 x BL680c blades with emulex and boot from SAN, ESX3.5U1. A couple of daya ago 5 VM 2003 machines lost their RDM paritions with the volume showing up as unallocated in windows and the rdm having now partition table on the volume. Nothing in the logs apart from quite a few SCSI reseravtion conflicts which look like they may be related to the Insight agents on the server. It may have been when i was running a rescan but no indication of its cause and cannot replicate. Did you find a fix? I've logged a call with VMWare and awaiting reply but we have had to stop until the issue is resolved

                         

                         

                         

                         

                         

                        Cheers

                         

                         

                         

                         

                         

                        L

                         

                         

                        • 9. Re: Lost Partition Table of RDM
                          Teovmy Enthusiast

                           

                          Hi LeeCarey

                           

                           

                          It is still a issue why this happend. But the only solution is to rebuild it with the Fdisk command.

                           

                           

                          But for that you need to know the 'Device name'  such as sda, sdb and so on.

                           

                           

                          Further on you need to know the start sector,mostly 128 and the ID (7 for ntfs)

                           

                           

                          But be carfuly with this command, because when you do it wrong all data can be lost.

                           

                           

                           

                           

                           

                          Hop this helps.

                           

                           

                           

                           

                           

                          • 10. Re: Lost Partition Table of RDM
                            LeeCarey Novice

                             

                            Thanks, we already used that to recover the data. What storage back end do you use. Ive noticed that i can get the same error codes on the esx server when the partitons failed by unmapping a LUN from the initiator group. Its not corrupting the RDM's though when this happens but we are starting to see things happening. The VM's also start their Virutal Disk Service and Logical DIsk Manager when this occurs. Very strange

                             

                             

                            UNMAP VOL_DELETE LUN 33

                             

                            Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.513 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x3f 0xe

                            Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.514 cpu4:2027)StorageMonitor: 196: vmhba1:1:0:0 status = 2/0 0x6 0x3f 0xe

                            Jun 20 11:45:29 cisqhsesx004 vmkernel: 16:18:34:11.852 cpu4:2027)StorageMonitor: 196: vmhba1:3:1:0 status = 2/0 0x6 0x29 0x0

                            Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:2:0 status = 2/0 0x6 0x29 0x0

                            Jun 20 11:45:30 cisqhsesx004 vmkernel: 16:18:34:12.532 cpu4:2027)StorageMonitor: 196: vmhba1:3:3:0 status = 2/0 0x6 0x29 0x0

                            Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x3f 0xe

                            Jun 20 11:45:33 cisqhsesx004 vmkernel: 16:18:34:15.493 cpu7:2140)StorageMonitor: 196: vmhba1:3:4:0 status = 2/0 0x6 0x29 0x0

                            Jun 20 11:46:08 cisqhsesx004 vmkernel: 16:18:34:51.156 cpu9:2146)StorageMonitor: 196: vmhba1:3:14:0 status = 2/0 0x6 0x29 0x0

                            Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead

                            Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead

                            Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead

                            Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead

                             

                             

                            • 11. Re: Lost Partition Table of RDM
                              Teovmy Enthusiast

                              > Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.327 cpu9:1055)SCSI: 2020: Marking path vmhba1:0:33 as dead

                              > Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.328 cpu9:1055)SCSI: 2020: Marking path vmhba1:2:33 as dead

                              > Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:1:33 as dead

                              > Jun 20 11:46:21 cisqhsesx004 vmkernel: 16:18:35:03.329 cpu9:1055)SCSI: 2020: Marking path vmhba2:2:33 as dead

                               

                              I believed we had the same error as showed above. but checking that resultes in alive paths. But again I can not see what happend on that time.

                               

                              We have a EMC clariion CX500

                              • 12. Re: Lost Partition Table of RDM
                                LeeCarey Novice

                                Were you performing anything on the storage array at the time these went down, presenting luns or unpresenting luns. Rescannig the storage? I managed to re-create it once again today but not after that. The difference was we had another esx cluster sharing the same storage with the current cluster, create a VM, add some storage, shutdown the newly powered on cluster, unmap an unused lun and instantly 5 VM's lose their MBR . Vmware are saying this should not make any issue as esx controls the access to the lun's not the virtual center clustering. Determined to fix this

                                • 13. Re: Lost Partition Table of RDM
                                  Teovmy Enthusiast

                                   

                                  I was rescannig the storage at that time. And probely the Hba was defect! But also after replace the Hba thsi stil happend after rescannen. After that I replaced the mainbord of the server and again a new Hba.

                                   

                                   

                                  After that this never happend again. So it seems the server where defect. Also Vmware said this could not be the problem.

                                   

                                   

                                  • 14. Re: Lost Partition Table of RDM
                                    LeeCarey Novice

                                     

                                    Thanks Teovmy, Did this only start happening after using 3.5 on your blades and are you using Virtual Connect?  Out of interest, how many hosts in the cluster and number of VM's. We just cant get the thing to replicate now and resorting to hitting it with a stick.

                                     

                                     

                                     

                                     

                                     

                                    Cheers

                                     

                                     

                                     

                                     

                                     

                                    L

                                     

                                     

                                    1 2 Previous Next