Most of strange behaviours in SANs and networks I've seen had to do with faulty FC GBICs oder NICs. In these cases none of the monitoring tools showed an error. We simply switched FC cables and FC ports until we identified the damaged one.
Have you tried to monitor network and SAN ports during the performance breakdown?
We are showing no errors during the down times, but they are so erratic, it's hard to be in place and ready. The san doesn't do constant logging, no alarms or alerts are present.
If we can get in our new home(dmx3000) fairly soon, I hope the problems disappear. But for the time being, I cannot risk the project, so I'm having to hunt down an erratic problem that doesn't last very long, and triggers no alarms.
We had something similar, although based arround iSCSI on the IBM Blade.. Flash absolutely everything to latest firmaware, disk, RAID Adapters, NIC's, System etc, etc..
Did that. Everything is current. We did the internal brocades in the chassis, the external brocades that are the cores, etc.
We had a very similar problem very recently and discovered that it was a "path thrashing" issue. Are all of your ESX Hosts setup to use the "most recently used" failover policy on their FC-HBAs? Can you duplicate the problem by transferring a LUN from one storage controller to another, simulating a failover?
I'd be very careful with that last; if path thrashing is really an issue, it can be an excruciating process to get it back to normal.
We're running mostly IBM LS20 blades against DS4300 & DS4800 hosts with brocade FC switches across the board.
We actually switched paths of late so we could do firmware updates on the brocades. The problem is annoying, as there is no rhyme or reason. We're applying several patches this next week.