VMware Cloud Community
TransennaTCF
Contributor
Contributor

ESX 3.5 disconnects from iSCSI SAN

The Short version of the problem:

I have a serious problem with one of my VI3 platforms: At non-periodical intervals an ESX server will disconnect an iSCSI port. Very often it will imediately reconnect, but depending on whatever has caused this error, this will go on for several seconds or untill the ESX disconnects the iSCSI port definitivly. The result is anything from very slow VMDK access to VM's that crash because they can't access their disks!! It looks to me as one or more VM's generate SCSI traffic that the ESX doesn't "like", but I've not been able to track own the specific cause. At any rate, SCSI traffic should not be able to crash an ESX iSCSI port...or am I badly informed?

Has anyone seen anything like this and what can be causing it - please help. !http://communities.vmware.com/images/emoticons/shocked.gif!

The Long version including setup description:

Setup:

  • 8x ESX 3.5 U4 on HP DL380 (8x3.1GHz, 32GBRAM og local disk boot)

  • 2 Datacore SANMelody 2.04sp1 på HP DL 380 (4x2,5GHz, 8GB, RAM local disk boot) each with 2xMSA60 on a P800 controller. 3xMSA60 use 12x300GB SAS 15krpm and the fourth 12x500SATA 7,2krpm.

  • VCMS er 2.5 runs in a VM (Windows Server 2003, 1,512GB RAM, 1VCPU)

Each ESX has a QLA4052c (2port) iSCSH HBA, with a port logged on to each SAN

  • Each ESX has 10 nic porte bundled so that there are 2 NICs pr vSwitch (5 i alt)

  • The SAN boxes have 19 og 20 LUNs of variable size fra 250-500GB; all are formatted with VMFS3.21 from VCenter with a VI klient.

What seams to happen is that one (or more) VMs at a nonperiodical interval, generate traffic on the SAN (to one or more of their VMDKs) and suddenly the ESX will disconnect all the LUN's on the SAN where the problematic VM has it's VMDK. The ESX will immediate reconnect the SAN, however as the VM is still generating the same traffic, it will be disconnected again. This behavior continues as fast as the SAN port can be dis- and re-connected over and over again. All VM's on the ESX will grind to a halt, as they can not access their disks. The VM's on the same host that have VMDK's on the other SAN are not affected! The timespan of this problem can be anything from seconds to hours. In some cases I can VMotion one or all of the VM's to one of the other ESX hosts, and once there, they resume normal operation. In other cases, VMotion is not possible (get's to 61% and then fails) and all the VM's will have to be killed from the service Console and the ESX rebooted.

Any and all help is needed - pre thanks!

0 Kudos
4 Replies
bister
Expert
Expert

Hi,

can you provide some more detailed error messages? Does your ESX complain about some locking errors to the iSCSI target?

I have a similar problem, causing the VMs not to be reachable resp. crashing.

Regards,

Christian

0 Kudos
TransennaTCF
Contributor
Contributor

Jeg tillader mig at holde lidt ferie og er derfor ikke på kontoret frem til man. d. 13/7-09.

NB: Vigtige meddelelser bedes rettet til telefon 8819 9990 eller hotline@transenna.dk

Med venlig hilsen

Thomas C. Fossing

IT Direktør, Partner

0 Kudos
TransennaTCF
Contributor
Contributor

Hi Bister;

Sorry for not getting back earlier, but this problem only got worse...

I never found the cause of the problem and I don't think I ever will Smiley Sad , even though I had all sorts of expexts on the case (Datacore, VMWare & top consultants! I ditched the Datacore SANMelody and jumped on a HP LeftHand Virtualization SAN, when the old Datacore units are being reused as VSA's. This is quite possibly the best structural decission I've made in the last couple of years - to date (1½ months) I havn't seen a single error which is quite unique with the last 9 month track record!

Along the way, I upgraded one of the SANmelody units from a Windows Server 2003 32 bit to a Windows Server 2008 64 bit platform and this did at first seem to be much more robust and indeed faster (dispite a known critcal WMI Windows interaction error, that could crash the Windows Box, with no fix in sight). However, as the load on this box was increased, so did the problems with this SAN, but I havn't done much about it, as the move to HP LeftHand is definative, and all SANmelody units will be fased out.

0 Kudos
TimPhillips
Enthusiast
Enthusiast

What can I say... DataCore is complicated in use software, I`ve used it for a few month, when I was developing SAN storage for our company. I`ve spend a lot of time testing iSCSI targets, and can say that your choose is good. Also I recommend you to use Starwind. After all my tests I can say that DataCore, Starwind and LeftHand - is real Top-3 between iSCSI targets. All of they have pros and cons, but I can say that it`s better to test all of this tools. Or, and maybe you can try open-e.

0 Kudos