Hi,
Ive had a highly unpleasant problem with iSCSI twice recently. In the first case, one LUN just kind of "jammed" - none of my ESXs could access it. The problem was fixed when I found restarted one ESX server. In other words it looked like one ESX had somehow pinned that one LUN into an unuseable state for all.
In the second situation, \_all_ my iSCSI traffic just jammed. Running iscsi stop/start on netapp solved the problem.
ESX's displayed unsurprising errors to the effect of:
mptscsi: ioc0: attempting task abort! (sc=f7008340)
scsi0 : destination target 1, lun 0
command = Write (10) 00 00 02 f2 df 00 04 00 00
mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
mptscsi: ioc0: task abort: SUCCESS (sc=f7008340)
In both cases this happened under heavy load - but heavy load is something these systems are under most of the time, so it may or may not be related.
My setup is a Netapp filer serving a pile of LUNs to 6 blades running ESX using the software iscsi initiator. Network is not congested (utilization like 10%).
Has anyone encountered this problem?
Instead of using the iSCSI from the ESX try using it from within the guest and see if that helps..
If you mean use the linux software initiator instead of ESX's, its not really an option.
If ESX's iscsi wont work (its jamming almost once a day now) we're going to have to move to proper SAN, which would be most disappointing ($$$).
The error message you posted pertains to your RAID controller (LSI Logic Fusion), not iSCSI.
Are you sure you're not having difficulty with your local RAID controller?
Paul
We had a similar problem, I don't recognise the error code but the symptoms you have are exactly what we had.
The issue was resolved by ensuring that the storage adaptor config --> Dynamic Discovery -->Send Targets option had the same ip and port number on each ESX host.
Previously we had multiple IP addresses configured on our SAN and ESX hosts, when we took this down to a solitary IP configured on the SAN and within the storage adaptor config the problem went away.
That sounded good, because this problem got more frequent when I added more targets to our ESXs. So I tried making it so every ESX has only one target and they are all the same, but sadly it didnt work. iscsi still jams under heavy load.
Some more logs if, by some miracle, someone is able to figure something out:
vmkwarning:
May 14 09:33:25 servername vmkernel: 7:00:03:08.984 cpu0:1024)WARNING: Helper: 1289: cancel request handle=1 fn=0x6134d4
May 14 09:33:55 servername vmkernel: 7:00:03:38.986 cpu2:1275)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:33:55 servername vmkernel: 7:00:03:38.986 cpu2:1033)WARNING: SCSI: 5615: status No such target on adapter, rstatus 0xc0de06 for vmhba40:1:0. residual R 995, CR 80, ER 3
May 14 09:34:04 servername vmkernel: 7:00:03:47.549 cpu1:1286)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:34:35 servername vmkernel: 7:00:04:18.987 cpu2:1266)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:34:44 servername vmkernel: 7:00:04:27.549 cpu1:1275)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:35:15 servername vmkernel: 7:00:04:58.987 cpu2:1276)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:35:24 servername vmkernel: 7:00:05:07.550 cpu1:1265)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:35:55 servername vmkernel: 7:00:05:38.987 cpu2:1266)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:36:04 servername vmkernel: 7:00:05:47.550 cpu1:1266)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:36:04 servername vmkernel: 7:00:05:47.551 cpu1:1036)WARNING: SCSI: 7916: status No such target on adapter, rstatus #c0de06 for vmhba40:1:0. residual R 995, CR 80, ER 3
May 14 09:36:04 servername vmkernel: 7:00:05:47.551 cpu1:1036)WARNING: FS3: 4008: Reservation error: No such target on adapter
May 14 09:36:05 servername vmkernel: 7:00:05:48.983 cpu0:1024)WARNING: Helper: 1289: cancel request handle=1 fn=0x6134d4
vmkernel:
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1285)WARNING: SCSI: 5422: READ of handleID 0x4994
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 8021: vmhba40:1:0:1 status = 0/3 0x0 0x0 0x0
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 8109: vmhba40:1:0:1 Retry (abort after timeout)
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 3169: vmhba40:1:0:1 Abort cmd due to timeout, s/n=27768, attempt 1
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)LinSCSI: 3596: Aborting cmds with world 1024, originHandle 0x72010d0, originSN 27768 from vmhba40:1:0
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 3182: vmhba40:1:0:1 Abort cmd on timeout succeeded, s/n=27768, attempt 1
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 8021: vmhba40:1:0:1 status = 0/5 0x0 0x0 0x0
May 14 09:37:16 servername vmkernel: 7:00:06:59.418 cpu2:1033)SCSI: 8116: vmhba40:1:0:1 Retry (timedout and aborted)
May 14 09:37:24 servername vmkernel: 7:00:07:07.599 cpu1:1265)WARNING: SCSI: 5422: READ of handleID 0x4994
hmm
besides what I mentioned above we did try some other things as well.
Do you have a dedicated switch for iscsi? or have you vlanned it off?
We decided in the end to create a separate vlan purely for iscsi to ensure nothing else could interfere with it.
Other things we also did were to increase the service console memory to the max 800Meg and ensure our VC Server (which is also a vm) had enough memory (2GB).
I wish I had a definitive answer for you as that would mean I could be 100% sure my own infrastructure was sound but if I'm honest I still do get occasional problems of the nature you describe though they are nowhere near as frequent as they once were.
Do you have a dedicated switch for iscsi? or have you
vlanned it off?
Not a dedicated switch, its vlanned. The switches are under no significant load and the network is all gigabit throughout.
We decided in the end to create a separate vlan
purely for iscsi to ensure nothing else could
interfere with it.
Yes, we're also using a totally separate vlan reserved purely for iscsi traffic. Every interface in this vlan is physically separate and untagged.
Other things we also did were to increase the service
console memory to the max 800Meg and ensure our VC
Server (which is also a vm) had enough memory (2GB).
I'll try these soon. For now we've had to bite the bullet and move over to normal fibre SAN. So far (knock on wood) its working without problems.