We have been testing out iSCSI on a ESX4 host and have noticed a problem.
At this point we do NOT have any host running off of iSCSI. There are two host that are on the iSCSI datastore but they are powered OFF.
If we power off our iSCSI target (an openfiler server) all of the VMs on the server appear to hang for at least few seconds up to a minute at a time at least twice an hour. We aren't sure if they are actually hanging or if the networking just stops working. During that time period we can't run esxtop because the service console also stops responding. We have not been logged into the console directly when it happens to prove if the entire host is hanging or just the network stack. As soon as we turn our iSCSI server back on everything goes back to normal. The problem started when we started testing iSCSI and we can't reproduce the issue when our iSCSI server is live.
During the outage time we also get gaps in our performance data. On our CPU percent graphs we see one core spike to 100% right before the gap but on the VM usage graph we don't see the spike (see attached).
Anyone have any ideas what could be causing this or how to fix the problem? We plan on turning off iSCSI but that requires a reboot of the EXS server.
There are 2 VMs that are on the volumes that is being turned off but they are shutdown/powered off when we see the problem.
This is the only iSCSI target that is exported to the ESX box.
I have attached my vmkernel log.
iSCSI went offline at/around 14:40 which matchs up with the log file (and is the first entry in my vmkernel for today)
The problem stoped around 15:05 which is when the log stops as well.
My issue and log file sounds very similar to this: http://communities.vmware.com/thread/212195
So, it appears you are trying to take the storage offline while some VMs using them ?. If you need to make any changes to storage, you should stop all the VMs and then make changes to storage, rescan on ESX and then re-start the VMs. Or use storage vmotion to move the VMs to another storage and make changes to storage and can svmotion back.
Following logs indicates that there are applications using the device:
Jun 4 15:02:07 VMH2 vmkernel: 8:15:48:15.645 cpu11:4119)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "t10.F405E46494C45400544746032793D286748584D203442736" due to Not found
Jun 4 15:02:07 VMH2 vmkernel: 8:15:48:15.645 cpu11:4119)WARNING: NMP: nmp_DeviceRetryCommand: Device "t10.F405E46494C45400544746032793D286748584D203442736": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Jun 4 15:02:07 VMH2 vmkernel: 8:15:48:15.645 cpu11:4119)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "t10.F405E46494C45400544746032793D286748584D203442736" is blocked. Not starting I/O from device.
Jun 4 15:02:08 VMH2 vmkernel: 8:15:48:16.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "t10.F405E46494C45400544746032793D286748584D203442736" - issuing command 0x4100080f1a80
Jun 4 15:02:08 VMH2 vmkernel: 8:15:48:16.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "t10.F405E46494C45400544746032793D286748584D203442736" - failed to issue command due to Not found (APD), try again...
Jun 4 15:02:08 VMH2 vmkernel: 8:15:48:16.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "t10.F405E46494C45400544746032793D286748584D203442736": awaiting fast path state update...
Jun 4 15:02:09 VMH2 vmkernel: 8:15:48:17.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "t10.F405E46494C45400544746032793D286748584D203442736" - issuing command 0x4100080f1a80
Jun 4 15:02:09 VMH2 vmkernel: 8:15:48:17.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "t10.F405E46494C45400544746032793D286748584D203442736" - failed to issue command due to Not found (APD), try again...
Jun 4 15:02:09 VMH2 vmkernel: 8:15:48:17.645 cpu15:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "t10.F405E46494C45400544746032793D286748584D203442736": awaiting fast path state update...
Is there some VMFS operations, like clone or backup going on ?. Just before taking the storage offline, can you do 'vsi_traverse' and upload the ouput file it creates ?.
No clone, migrate, or backup operations are running when we experience the problem.
The problem doesn't start as soon as iSCSI disconnected, it happens about once every 30 minutes so we can be running for up to 30 minutes with iSCSI disconnected before we notice the problem.
Is there maybe something ESX is running in the background at a regular interval (20 or 30 minutes) that would be trying to access iSCSI? If so why would that lock up the entire server?
I'm having issues uploading the vsi_traverse output to the forums, I'll post it as soon as I figure out what I am doing wrong.
30 minutes is the timeout vSphere server uses to periodically query the capacity of the datastores, how much is used etc. So even if no VMs are using it, hostd will try to open the datastores every so often to get the datastore information ...
So that part is explainable. Now why it causes the COS to hang is odd. How did you figure out if the machine hung or not?
One thing I forgot to ask: Have you done rescan on ESX after removing/disabling the storage ?. It is important to perform a rescan after storage is removed/disabled.
BTW, could you please mention which storage are you using and how you are disabling ?.
We are using openfiler as our iSCSI target. We are just in the testing phases of using iSCSI and if all goes well we might switch to a netapp,
We did not remove the storage at all from ESX.. We initially found the issue after an unexpected iSCSI outage. Since the problem didn't show up untill 20-30 minutes after the iSCSI went down we didn't make the connection that it was the cause. We figured it out when the iSCSI was down a second time (this time was expected but we didn't remove the target from ESX, we just shutdown the two VMs that were using iSCSI)
So basically the problem we are seeing is that if the iSCSI goes down for any reason ALL of the VMs (no matter if they were using the iSCSI target or not) start to have problems.
I'll try and get the SR filed ASAP. It is in another users name and he is having issues getting me added.
I've been doing some iSCSI work also with SanMelody and ESX 3.5 and 4.
Being a SCSI "bus" (and I'm theorizing here) it seems that any interruption to the normal operation, as it shutting it down without detaching from the ESX servers, would cause a problem. I don't know if iSCSI has some built in compensation for that.
Whether or not an individual VM is using the iSCSI for operations would be irrelevant. It's the ESX server itself that is establishing the link (hence needed to scan and identify the target in the first place) to the SAN. In my setup I have 2 ESX servers in a cluster, sharing the same datastore on the ISCSI SAN.
I've noticed when I've booted those ESX servers without the SAN being turned on (again test environment) the boot process hangs for a few minutes when its searching to reconnect to the iSCSI target. Its actually quite a long time out period. Is it possible that VMWare just gets a wild hair when it detects that the SAN has gone offline and it attempts to search for it?
Please advise if any of my observations are wildly off. I'm just posting what I've run into myself.
We're seeing the exact same issue as you. If iSCSI storage is powered down (even after removing its ip from the dynamic discovery list and performing a rescan across all HBAs on all ESX hosts) we get substantual packet loss on a regular basis with the same symtoms you describe.
I wonder if there is anyway to reduce the timeout on the regular hostd scan that was mentioned here? Also we have found that the only way to get an ESX host to completely 'forget' the iSCSI datastore is to perform a reboot.
Do you have any updates on this from support?