I have an Equallogic PS100E array with dual controllers running firmware 3.2.4. Controllers are connected to 2 Cisco 3560G switches in mesh configuration (2 ports from each controller go to a separate switch), and all ESXi 4.0 hosts (6) are connected to those switches in redundant mesh config (4 NICs, 2 network and 2 iSCSI) as well. VLANs separate iSCSI from regular network traffic. From that perspective everything is "by the book" in terms of redundancy, I can lose a NIC, cable, port, switch or a controller and everything will keep running.
Can I upgrade controller firmware to 4.3.6 (support suggested this path : 3.2.4 > 4.0.7 > 4.1.7 > 4.3.6) without downtime? Support is not very clear, basically "yes it works but..." kind of answer. I need to know if I must shut everything down or upgrade moves traffic and connections from one controller over to another seamlessly while VMs are running. Storage has 2x 1TB volumes, both are VMFS and both contain multiple VMs. There are no VMs that have direct access to storage, we are not using any iSCSI intiators within VMs. Only ESXi hosts have access to iSCSI volumes.
Does anyone have experience doing this in live environment? Any performance issues during the upgrade? Any tips?
Have a good read of the release notes for each of the firmwares, you are looking for details on non disruptive updates, which I believe was introduced in 4.0.
Therefore 3.2.4 > 4.0.7 is going to incur some downtime, but upgrading 4.0.7 > 4.1.7 > 4.3.6 > and onwards will not.
You can update without taking a complete outage:
3.2.4 > latest 3.3.2 Patch 2 or higher (latest 3.x is best). You need to be on this version so that when you switch to 4.x you don't incur an outage while the internal data structures on the group get converted to 4.x
Latest 3.x > 4.0.x
Latest 4.0.x > Latest 4.1
Latest 4.1 > 4.3.6
Array restarts on v3.x firmware will take anywhere between 50 and 70 seconds. Array restarts on 4.x will take about 30 seconds. In order to survive these delays you will need to set your VM disk TimeOutValues between 60-120, I've got mine at 120.
A good way to test this is to have an ESX host with 1 VM on it and then change the VLAN on the port(s) where the iSCSI NICs/HBAs are. Change the VLAN to some unused VLAN. This will simulate the duration where you lose access to the array during the restarts. Testing for 50-60 seconds would be safe.