upgrade from ESX 3.5 to 4.0u2 takes the SAN down!

dlee2654 · ‎08-12-2010

Hello, I just want to start by saying for the most part this update manager has been flawless in upgrading my hosts from 3.5 to 4.0 update 2 but in one of my remote locations I ran an update on a host and when the host came up it took all the luns offline for the entire cluster?! I have opened a ticket with support but so far they dont see anything wrong.

these are Dell PowerEdge 2950 servers (dual core proc with virtualization technology but no eXecute Disable option in the bios,EVC not supported)

QLE2462 HBA,

brocade sw200E switches,wwn zoning redundant fabrics.

NetApp 3020C with Data Ontap 7.2.2

symptoms so far are that the server will hang for 15 minutes or so then finally continues to boot. There are no luns listed in the storage adapters section and very shortly thereafter I got reports that all the VMs were down for the entire farm!!! consoled into the other two ESX 3.5 hosts and indeed the SAN storage was not avialable when running vdf -h

I ended up having to reboot the other two hosts after taking the newly updated host offline.

If anyone has experienced this sort of thing and found a resolution, please reply. I am now facing having to reload the server back to 3.5

so right now I have my upgraded host disconnected from the SAN and have sent vm-support logs to VMware and this was my reply. Sounds like they are clueless as to what the problem might be. thanks in advance...

DO NOT CHANGE THE SUBJECT LINE if you want to respond to

this email.

Hello Donald,

I looked at the support logs and I found this:

vmkernel.1:

...

Aug 10 15:50:15 rey-esx-03 vmkernel: QLogic Fiber Channel HBA Driver:

8.02.01-k1-vmw48: vmhba1

Aug 10 15:50:15 rey-esx-03 vmkernel: QLogic QLE2462 - PCI-Express Dual Channel

4Gb Fibre Channel HBA

Aug 10 15:50:15 rey-esx-03 vmkernel: ISP2432: PCIe (2.5Gb/s x4) @ 0000:0c:00.0

hdma+, host#=6, fw=4.04.09 [IP] [M

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:01:25.246

cpu7:4111)PCI: driver qla2xxx claimed device 0000:0c:00.0 Aug 10 15:50:15

rey-esx-03 vmkernel: 0:00:01:25.246 cpu7:4111)LinPCI: LinuxPCI_DeviceClaimed:

Device c:0 claimed.

Aug 10 15:50:15 rey-esx-03 vmkernel:

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:01:25.246

cpu7:4111)PCI: Trying 0000:0c:00.1 Aug 10 15:50:15 rey-esx-03 vmkernel:

0:00:01:25.246 cpu7:4111)PCI: Announcing 0000:0c:00.1 Aug 10 15:50:15

rey-esx-03 vmkernel: 0:00:01:25.246 cpu7:4111)<6>qla2xxx 0000:0c:00.1:

Found an ISP2432, irq 145, iobase 0x0x4100b2e3e000 ...

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:01:27.687

cpu4:4213)ScsiScan: 844: Path 'vmhba2:C0:T1:L2': Vendor: 'NETAPP '

Model: 'LUN ' Rev: '0.2 '

...

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:01:23.104 cpu5:4111)VMK_PCI:

638: Device 012:00.0 name: vmhba1 var/log/vmkernel.1:Aug 10 15:50:15 rey-esx-03

vmkernel: 0:00:01:23.363 cpu0:4096)<6>qla2xxx 0000:0c:00.0: LIP reset

occured (f700).

var/log/vmkernel.1:Aug 10 15:50:15 rey-esx-03 vmkernel:

0:00:01:23.366 cpu0:4096)<6>qla2xxx 0000:0c:00.0: LIP occured (f700).

var/log/vmkernel.1:Aug 10 15:50:15 rey-esx-03 vmkernel:

0:00:01:23.366 cpu0:4096)<6>qla2xxx 0000:0c:00.0: LIP reset occured

(f7f7).

...

var/log/vmkernel.1:Aug 10 15:50:15 rey-esx-03 vmkernel:

0:00:02:24.290 cpu7:4109)ScsiDevice: 1904: Failing registration of device

'naa.60a9800043346b712f6f517869496665': failed to acquire legacy uid on path

vmhba2:C0:T1:L9: Timeout ...

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:01:44.290

cpu7:4109)VMWARE SCSI Id: Id for vmhba1:C0:T2:L9 Aug 10 15:50:15 rey-esx-03

vmkernel: 0x60 0xa9 0x80 0x00 0x43 0x34 0x6b 0x71 0x2f 0x6f 0x51 0x78 0x69 0x49

0x66 0x65 0x4c 0x55 0x4e0x20 0x20 0x20 Aug 10 15:50:15 rey-esx-03 vmkernel:

0:00:02:24.290 cpu3:4204)<6>qla2xxx 0000:0c:00.1: scsi(7:1:9): Abort

command issued -- 1 a2b 2002.

Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:02:24.290

cpu7:4109)ScsiDevice: 1904: Failing registration of device

'naa.60a9800043346b712f6f517869496665': failed to acquire legacy uid on path

vmhba2:C0:T1:L9: Timeout Aug 10 15:50:15 rey-esx-03 vmkernel: 0:00:02:24.290 cpu7:4109)WARNING:

NMP: nmp_RegisterDevice: Registration of NMP device with primary uid

'naa.60a9800043346b712f6f517869496665' failed. Timeout Aug 10 15:50:15

rey-fesx-03 vmkernel: 0:00:02:34.296 cpu3:4204)<6>qla2xxx 0000:0c:00.1:

scsi(7:1:3): Abort command issued -- 1 a44 2002.

...

Note: these messages for L1, L2, L3 and L9

I checked the QLogic HBA on the HCL and it is certified

with the QLogic device driver listed in the vmkernel log above

(8.02.01-k1-vmw48). We already know the SAN is on HCL for this version of ESX.

Only from the ESX host messages I do not know why the

connection to these LUNs times out and why the issue happened.

What you can do is to

create a separate storage group on the NetApp and create 2 LUNs on it

and add only this host HBAs to this group so that only this hosts sees the 2

LUNs and then reboot the host and see where it goes.

This way the production LUNs are no affected. If it boots

OK and sees the LUNs properly and you can create datastore on those LUNs it

shows it was a transient condition during the in place upgrade.

RParker · ‎08-12-2010

! I have opened a ticket with support but so far they dont see anything wrong.

Something else is wrong. It's not a VM Ware issue, which is why they haven't found anything. It's not a VM issue because as you said your server rebooted. So it's hardware at that point in time. I suspect you have QLogic BIOS enabled, and the HBA's are doing something weird to your Fiber switch, and that's what is causing the LUN's to go offline.

At any rate, we have 2950, Netapp, Fiber, been at this for almost 4 years.. NEVER seen this before, we have almost 30 ESX hosts. This is a hardware issue, or maybe it was just a fluke.

The manual ALSO states that ANY time you perform an upgrade you ALWAYS disconnect from SAN. So it's a hard lesson to learn, but follow ALL the instructions for ESX upgrades / updates and check the HBA for problems.

This is a hardware issue, not a result of the ESX update. It was coincidence that it happened, and a reboot of the ESX before the update probably would have done the same thing.

dlee2654 · ‎08-12-2010

thank you for your reply, but I had very recently rebooted this server shortly before migration and once the weekend before when doing some unrelated maintenance. rebooting the server under ESX 3.5 never caused this problem in the past. The BIOS is NOT enabled, and I also have much experience and a very large environment (over 50 hosts since 2004!) and like I said, this is the first time this has happened to me, but so far it appears that this server is now crippled and to get it back to normal I have to re-install ESX 3.5. I dont have an issue doing this but I was just hoping someone knew of some HBA/ESX version/Ontap or some other incompatibility that I have not yet discovered. I have yet to disconnect from the san and have a problem with the in place upgrade process, but I can see how this might prevent a failed upgrade, I dont see how it relates to the problem I have right now.

thanks anyway,

-d

dlee2654 · ‎08-12-2010

I also wanted to add that the crashing happens after ESX attempts to start, not before the OS loads.

it hangs on this statement for over 15 minutes, during this time it makes the storage unavailable for the other ESX hosts.

storage-drivers...
Starting Path Claiming and SCSI Device Discovery...

eventually it boots up but not with the storage and the rest of the hosts get disconnected.

ProPenguin · ‎08-12-2010

This is why I like iSCSI. Have you checked the zones and everything on the switch lately? Dealing with fiber in the past most of my issues spawn from the switch in the middle. I would investigate there. Hope this helps.

dlee2654 · ‎08-24-2010

Just an FYI, I had to upgrade Ontap to resolve this issue. Nothing would resolve it, I upgraded from 7.2.2 to 7.2.7, after that everything was golden. NetApp support stated that officially 7.2.7 is the earliest known tested version of Ontap to be used with ESX 4.0, but that some earlier versions like 7.2.6.1 or 7.2.5.1 may work as well. Indeed I have other sites with 7.2.5.1 running FC with ESX 4 with no problems, so I suspect that 7.2.2 is just too old of a version to work with ESX 4.0. thanks for the suggestions.

All

upgrade from ESX 3.5 to 4.0u2 takes the SAN down!