VMware Cloud Community
ArrowSIVAC
Enthusiast
Enthusiast

SCSI Emulation on SAN

I believe I know the root cause but want to see if their was someone who could validate, and also help me with knowing how to better 'root cause' these issues faster on 3i in the future. The details are a bit long. I applogize but I belive it provides the nessiary information.

Background:

I have a lab in which I have a number of servers. I started with 3.5U2 and got a new toy so installed it with 3i. I have 70+ virtual machines on my SAN. The SAN is a IBM DS4300 single controller unit with 16x300GB Fibre Drives. I decided to break the solution into two arrays (7+P Array1 VMFS0) (6p Array2 VMFS1)+1HS. Also important is that all servers in my lab are SAN boot. I have

twenty 3GB boot LUNS sliced off the first array for VMWare ESX boot

targets. The issue I ran into was that the eight spindles were being crushed to the point the VMs were timing out. I then decided the only fix was to drop the (6p Array2 VMFS1) and make Array1 VMFS0 a(14P) array to get more aggregate spindles working for the VMs. I discovered that I coud not expand Array 1 by more then two spindles, so I figured I would upgrade the SAN controller from firmware 5.34.10.00 (last supported firmware on the IBM single controller DS4300) to 06.60.17.00 which is suppose to be for dual controller vesions, but I would just ignore the "missing controller" error if I got the functionality of being able to expand the array. I completed the upgrade and all my VMWare servers booted fine. I expanded the array and again, tested that the ESX servers booted fine with the new (14+P) Array.

All seemed to go well... .. but.....

I went to start my virtual machine pool back up. I see that the VMFS0 volume is still there, but all my virtual machines are greyed out like they are offline or not acessable. I go to the Fibre HBA on a few of the servers and (as they are SAN booting fine) I figured my SAN zoning and partitioning was ok. I scan the HBAs for LUNS and it sees all the normal LUNs, including LUN 20 which is my VMFS0 1.5TB volume. I go into storage and see VMFS0 is still their but when I right click and browse all I see are "vpxa.log" and other versions Ex: vpxa-0.log vpxa-1.log etc..... . I ssh into the box and see:

root@8877cle2 volumes# pwd

/vmfs/volumes

root@8877cle2 volumes# ls -alh

total 0

drwxr-xr-x 1 root root 512 Nov 22 17:23 .

drwxrwxrwt 1 root root 512 Nov 22 08:04 ..

root@8877cle2 volumes#

As such was a bit perplexed about where my volume went.

root@8877cle2 volumes# tail /var/log/messages

... was not a huge help.

I looked at dmesg and found only this odd message about my vmfs0 volume lun.

SCSI device sdv: 3383503093 512-byte hdwr sectors (1733327 MB)

sdv: sdv1

SCSI device sdw: 40960 512-byte hdwr sectors (21 MB)

sdw: unknown partition table

Vendor: IBM Model: Universal Xport Rev: 0617

Type: Direct-Access ANSI SCSI revision: 05

VMWARE SCSI Id: Supported VPD pages for sdx : 0x0 0x80 0x83 0x85 0xc0 0xc1 0xc2 0xc3 0xc4 0xc5 0xc7 0xc8 0xc9 0xca 0xd0

VMWARE SCSI Id: Device id info for sdx: 0x1 0x3 0x0 0x10 0x60 0xa 0xb 0x80 0x0 0x39 0xbf 0xa7 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x93 0x0 0x8 0x20 0x24 0x0 0xa0 0xb8 0x38 0x9a 0x97 0x1 0x94 0x0 0x4 0x0 0x0 0x0 0x1 0x1 0xa3 0x0 0x8 0x20 0x4 0x0 0xa0 0xb8 0x38 0x9a 0x97

VMWARE SCSI Id: Id for sdx 0x60 0x0a 0x0b 0x80 0x00 0x39 0xbf 0xa7 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x55 0x6e 0x69 0x76 0x65 0x72

Disk sdx is a pseudo device. lid = 31, ro = 0, cap: (512 * 40960) = 20971520

VMWARE: Unique Device attached as scsi disk sdx at scsi2, channel 0, id 0, lun 31

Everything else looked fine... I was at that point swearing at my SAN Smiley Happy Those 70VMs is about 1 1/2 months of my life working on building demo systems!!!

Soo... after a cup of cofee...

I started back with the physical. QLogic saw all the LUNS ok, the SAN zoning looked fine, the controller looked like all the WWN were partitioned ok. So I then removed on of the servers I have totaly and rebuilt it from scratch. At that point I noticed that the only host-type option for the DS4300 was "Linux" vs "Linux Cluster" which I use to set it to.. BINGO!... I then realized that ALL my SAN HBAs in my VMWare farm were now set to AIX !

I reset all my hosts to Linux, reboot, but same issue, but as I suspected, that would not fix the issue in that the SCSI tag reservations of host type "Linux Cluster" was required for ESX to properly tag and mount a VMFS volume.

What happened after reading lots on the DS4xxx product is that IBM added hosts kits for Linux later on that they charged for. They moved the "Linux Cluster" into the new "VMWare" host kit, and now charge for it. So when I upgraded my DS4300, I lost the option to set host type to "Linux Cluster" and so all the HBAs had 'invalid host types set" and were reverted to the default top option of the host type list aka... AIX.

*************

Question:

1) How can I validate what I believe to be the case above concerning SCSI reservation locking?

2) I tried other host types such as "Windows NT 4.0 Cluster" "Windows 2003 Cluster" etc... but none work. Can someone explain a bit of detail about what SCSI tagging VMWare is looking for before it will mount the VMFS volume?

3) The only way I would have been able to debug this is through the VMWare 3.5 console. I have one 3i that I was just bringing up before this all happened and can't imagine how someone could debug / root cause an issue like this without a shell. The only message in the VMWare logs on the 3.5 system was that the vm systems could not be located. On the 3i server the message was:

11/22/2008 7:58:50 am, Issue detected on 8878cle1 in aesscle IBM: LVM: 4476: vml.020014........ (1:14:00:25:092 cpu8:440588)

Is their a VMFS flow diagram of how the mount and validation process is done so we can key in on these issues faster. The above took me all day to work through. I can't imagine someone being in a production environment where they just lost 70 production systems due to a simple upgrade. Even if this is a "issue of IBM's SAN" that does not help IT people know how to point back at the SAN group that their VMWare servers are not the issue, that it is the storage controller's issue.

PS: The inability to highlight and copy text from the VMWare client log is very irritating.

Reply
0 Kudos
3 Replies
ArrowSIVAC
Enthusiast
Enthusiast

I discovered the root cause of the issue.

It was as I suspected a SCSI emulation issue which was stopping VMWare from mounting the volume, but the root cause was a NVS RAM flash issue on the DS4300.

I discovered the issue by moving the physical drives of the array to a second controller. All nodes definitions then popped their host type back to "LNXCLVMWARE " from "AIX"

****************

Broken DS4300

Firmware version: 06.60.17.00

NVSRAM version: N2880-540800-004

Management software version: 09.60.G5.41

NVSRAM HOST TYPE DEFINITIONS

NOTE: The following indexes are not used: 13 - 15

HOST TYPE ADT STATUS ASSOCIATED INDEX

AIX Disabled 4

HP-UX Enabled 3

Irix Disabled 5

Linux Enabled 6 (Default)

Netware Failover Enabled 11

Netware Non-Failover Enabled 9

PTX Enabled 10

Solaris Disabled 2

Solaris (with Veritas DMP) Enabled 12

Windows 2000/Server 2003/Server 2008 Clustered Disabled 8

Windows 2000/Server 2003/Server 2008 Non-Clustered Disabled 1

Windows NT Clustered (SP5 or higher) Disabled 7

Windows NT Non-Clustered (SP5 or higher) Disabled 0

****************

****************

Working DS4300

Current configuration

Firmware version: 06.60.17.00

NVSRAM version: N1722F60R960V0AF

Management software version: 09.60.G5.41

NVSRAM HOST TYPE DEFINITIONS

HOST TYPE ADT STATUS ASSOCIATED INDEX

AIX Disabled 6

AIX-ADT/AVT Enabled 4

DEFAULT Disabled 0 (Default)

HP-UX Enabled 7

IBM TS SAN VCE Enabled 12

Irix Disabled 10

LNXCLVMWARE Disabled 13

Linux Enabled 5

Netware Failover Enabled 11

Solaris Disabled 8

Solaris (with Veritas DMP) Enabled 14

Unused1 Disabled 1

Windows 2000/Server 2003/Server 2008 Clustered Disabled 3

Windows 2000/Server 2003/Server 2008 Clustered (supports DMP) Enabled 15

Windows 2000/Server 2003/Server 2008 Non-Clustered Disabled 2

Windows 2000/Server 2003/Server 2008 Non-Clustered (supports DMP) Enabled 9

Lesson learned don't trust a upgrade till you verify the version changed.

Reply
0 Kudos
esdsv
Contributor
Contributor

We need to connect 2 vmware esx 3.5 hosts to a DS4300 Dual controller with firmware 6.xxxxx, We don't have the VMWARE HOST KIT, and seems that IBM doesn't sell it anymore.

Do you now if this clustered configuration with two ESX hosts works? To form a cluster with HA,DRS,VMOTION?

What HOST TYPES and settings do you set on your ds4300. ?

Thanks in advance.

Reply
0 Kudos
SeeSite
Contributor
Contributor

When you want ti use VMware the Host Type is LNXCL (In the new Firmware Version you can see a new ReTyping LNXCLVMWARE)

Have Fun!

HINT: When you use Raw Device Mapping, you have to choose the HostTape of the VM OS! for that LUN!

IT Senior Consultant

CEMA AG Spezialisten für Informationstechnologie

Alter Wandrahm 15

20457 Hamburg

IT Senior Consultant CEMA AG Spezialisten für Informationstechnologie Alter Wandrahm 15 20457 Hamburg
Reply
0 Kudos