Hi everybody,
One of my clients has recently had a few issues with a HP MSA2012i where one of the disks died in the array (RAID 6, so wasn't an issue) and was replaced with a new one. The array started to rebuild the disk as it should. However, due to some unknown issue, it decided to take the array offline. The client brought the array back up and all was well, both from a usability and rebuild point of view.
After rebuilding the disk, one of the volumes disappeared from the VMware ESX servers (two ESX servers running 3.5 Update 2). The volume (LUN 1), which was called DRSANSATA1, exists on the same physical RAID set as another LUN (LUN 0), called DRSANSATA0.
From the VMware console (and CLI), DRSANSATA0 can be seen and used as normal. DRSANSATA1 is missing.
There are now no issues with the SAN RAID set. Host Mapping on the SAN is correct.
Within Storage Adapters, LUN 1 can be seen and all the paths are correct. Performing a Rescan does nothing.
Within Storage, if you select 'Add Storage', you can see LUN 1 and from the 'Current Disk Layout' it even remarks that it is 'VMFS' formatted. Again, performing a Refresh does nothing
Any ideas on how to re-associate this volume with ESX without blowing it away?
Many thanks in advance,
Daniel
Not really familiar with the MSA2000 series. You may want to check this out though
is it possible it got a new SCSI VPID and is therefore beeing seen as a snapshot?
--Matt
For the LUN ID to change wouldn't there need to be a controller type mismatch or something similar (Not really a super storage guy)? This is a new piece of hardware so maybe there is an issue that HP is not aware of or has not fessed up to.
I think you might want to call HP before setting EnableResignature and trying to bring the Volume back online
That entirely depends on the storage array, and some are better than others. The MSAs are well known to be in the 'others' group
--Matt
Hi,
Yes it's possible to bring it back provided it not badly corrupted.
Does the host that is still running see that storage and properly access it?
What does esxcfg-mpath -l report on the host with the issue?
also check the vmkwarning logs
e.g.
cat /var/log/vmkwarning
Hmm, that's exactly what I thought after replacing a controller in another MSA2000 two weeks ago and finding that the LUN ID had changed.
However, I've already tried the workaround of setting LVM.DisallowSnapshotLUN to 0. Still no joy.
Thank you for your response. The paths all seem to be correct, as well as the preferred path.
I had checked the logs earlier and nothing was being reported. However, I have just rebooted one of the servers and now we have something to work on. The issue is kind of what I thought it was, and wishing wouldn't happen. Here's the message: -
Jan 15 16:40:23 mhgsvesx00789 vmkernel: 0:00:01:16.196 cpu3:1040)WARNING: Vol3: 611: Couldn't read volume header from 4907489e-a8e1739e-a009-00215aaa8368: Address temporarily unmapped
Any ideas on how to rescue a VMFS header?
Many thanks,
Daniel
Now we all know HP equipment is the best! Shame that this is a rebagged Dothill product then
Yes. But I'm not sure it will help you.
Here's how it was done.
http://communities.vmware.com/message/1033534
Can you post the fdisk -lu output from one of the host?
Also run the command vmkfstools -V and post the last 4-5 lines in the /var/log/vmkernel.
-Surya
Hi Surya
Here are the outputs you requested: -
fdisk -lu
Disk /dev/sda: 2000.3 GB, 2000375775232 bytes
255 heads, 63 sectors/track, 243198 cylinders, total 3906983936 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/sda1 128 -387991427 1953487871 fb Unknown
Disk /dev/sdb: 2000.3 GB, 2000375775232 bytes
255 heads, 63 sectors/track, 243198 cylinders, total 3906983936 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 128 -387991427 1953487871 fb Unknown
Disk /dev/cciss/c0d0: 73.3 GB, 73372631040 bytes
255 heads, 63 sectors/track, 8920 cylinders, total 143305920 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/cciss/c0d0p1 * 63 208844 104391 83 Linux
/dev/cciss/c0d0p2 208845 20691719 10241437+ 83 Linux
/dev/cciss/c0d0p3 20691720 23968979 1638630 82 Linux swap
/dev/cciss/c0d0p4 23968980 143299799 59665410 f Win95 Ext'd (LBA)
/dev/cciss/c0d0p5 23969043 32162129 4096543+ 83 Linux
/dev/cciss/c0d0p6 32162193 40355279 4096543+ 83 Linux
/dev/cciss/c0d0p7 40355343 48548429 4096543+ 83 Linux
/dev/cciss/c0d0p8 48548493 143090954 47271231 fb Unknown
/dev/cciss/c0d0p9 143091018 143299799 104391 fc Unknown
Disk /dev/sdb: is the affected LUN
As for doing a Rescan, the only messages presented are as follows: -
Jan 15 23:12:10 mhgsvesx00788 vmkernel: 0:03:54:51.775 cpu2:1041)WARNING: Res3: 1053: resource 4 (cluster 9) already freed by another host: This may be a non-issue
Jan 16 01:08:13 mhgsvesx00788 vmkernel: 0:05:50:54.548 cpu3:1041)WARNING: Vol3: 611: Couldn't read volume header from 4907489e-a8e1739e-a009-00215aaa8368: Address temporarily unmapped
Daniel
Yep, that's exactly what I was worried about.
Not sure as to whether to fix this partition or blow it away. It only holds replicated VM's anyway, so not important. However, might have a little fun first!
Wuld it be possible for you to send the dd dump of both the sda and sdb? I need the first 100K.
dd if=/dev/sd<X> of=/tmp/sd<X>.out bs=1024 count=100
If you are not comfortable posting the dump here you can PM me.
-Surya
Device Boot Start End Blocks Id System
/dev/sda1 128 -387991427 1953487871 fb Unknown
Device Boot Start End Blocks Id System
/dev/sdb1 128 -387991427 1953487871 fb Unknown
This does not look good, you should not have a negative value for the ending sector.
Is the volume backed by /dev/sda1 functioning correctly?
Yes, this device is working perfectly (DRSANSATA0).
Disk /dev/sdd: 2199.0 GB, 2199023190016 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967168 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/sdd1 128 -5612 2147480778+ fb Unknown
I just created a 2TB LUN at home and it appears that if the raw device is near 2TB the partition end block value is displayed as a negative integer.
So that does not look like a real issue.
Mike,
It's not a negitive value, just that fdisk can not show the complete value and uses a "-" inbetween the start and end blocks.
Try sfdisk -luB this should show you the correct value.
-Surya
Sorry Wrong post ..... removing it .. .
-Surya