I have discussed this event with others internally and I have not been informed of a method of filtering, or throttling these events. The request for this feature has been submitted. The feature request submission, review, approval, and developement is not a public process. We cannot make any public facing statements or share any details as to whether the feature will be included in a future version. If you feel strongly about this feature request, please reach out to your account management team to provide use cases and help prioritize it. Thank you.
Hi Daniel - thanks so much for the response - appreciate VMware having a look at the thread. A filter feature would be a great addition if time/resource permits for you guys! Thanks again. Doug
I'm seeing these alerts as well, but there doesn't seem to be any alarms in vCenter to change these correct? I wonder why that is? Shouldn't everything be created as a vCenter alarm? aka So how would I change the alerts to get email notifications if I wanted those? I'm not seeing anything for vCenter alarms for devices.
I do not believe these messages are false. These messages occur because there IS high latency occurring, although very briefly. I have confirmed that there is a problem with Esxi 5 and the software ISCSI initiator. I purchased vMware from Dell with Dell R710 servers and must get my support through Dell.
Look at this:
What this shows is extop for 2 Esxi hosts accessing the same datastore. For some reason, the host that is 'inactive' seems to pause or lag where the latency can spike anywhere from 50-2000 milliseconds. In this example, its 561 ms on the inactive host and 1.8 on the active host. When I run IOMeter on VMs that run on the datastore, the average performance is normal, but IOMeter does show the Max I/O Response time jumping to the high latency numbers reported in the vCenter event log. These are also shown in the performance graphs for the hosts.
The reason most people say ignore this, I believe, is because having high latency on the inactive host and not the active applications or VMs will result in them, generally, performing as expected. However, with a high workload, it can actually trigger the inactive host to lose the connection altogther. Also, if both hosts try to access the same datastore at the same time, the actual VMs or applications CAN lag significantly because of this.
There is clearly a problem with software ISCSI. I have completely different datastores, differrent physical and virtual hardware, and completely seperate drivers for different hardware. The only thing in common is software ISCSI and Esxi 5 (does not happen on 4.1).
This seems to be a weird locking issues or something like it between different hosts accessing the same ISCSI datastore. I can reproduce this on any ISCSI initiator and for comepletly different ISCSI datastores. This is a vMware bug that needs to be addressed, not ignored. I wish I could deal with vMware directly, but I have to go through Dell. For those who seemed to report a similar issue with non software ISCSI, it looks like it really is your hardware/setup and that just hides this particular problem.
vMware... this is reproducable. Something is wrong with software ISCSI in Esxi 5.
Message was edited by: TrevorW201110 to correct grammar.
This is basically what i noticed as well. The times I see these messages in the event log are usually during hours where there is almost no traffic on the SAN and/or vm hosts which is why i think most people are saying to either ignore the warnings or they are not true. there clearly is a problem of some sort though. I have been working with dell to alleviate high latency times reported on the san group itself (we have equallogics) and their suggestions have definitely helped in that regard but these warnings in vsphere still remain. I think i might submit a ticket with vmware just because i pay for support and i want to make sure they are seeing reports of this instead of hoping they read these forums.
Yeah, I have been trying everything I can think of to resolve this issue myself, but haven't been able to.
Again, I didn't notice this until I migrated to vSphere5 and with the number of people reporting this issue, I definitely think VMware needs to be taking a look at this.
I've noticed the same issue on my FAS2040, but my events happen only when running SMVI (NetApp VSC) backups and when Snapshots are being removed. (they usually happen together)
Check to see if you can correlate the events with high NetApp CPU usage. If so, check to see if you have large numbers of Snapshots and possibly in combination with dedupe. Also check your NetApp and SMVI snapshot schedules.
This is interesting.
I was on a call for a vCenter Server issue and the vmware tech was looking at the logs on one of my hosts and he noticed the I/O latency increased errors in the logs. The funny thing is they didnt start appearing till after I updated to 5 update 1. I checked my other host and it had the same errors on luns that are presented to multiple ESXi hosts. Prior to the upgrade I was recieving iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0] errors. This has something to do with EqualLogic luns that aren't configured for access from multiple ESXi hosts...so that is okay. I dont seem them happening anymore after the 5 update 1 upgrade...odd.
There was a netorking bug in 5 that could affect iSCSI connections, but I dont know if it is the same one being discussed in this thread. It has been fixed in 5 update 1. See the links below.
Can anyone confirm if this issue persists into 5 update 1 that had the latency errors prior to upgrading.
I ended up putting a support call into vmware about this message. For me, I wasnt actually noticing any problems with my setup, I was just worried about the messages. After the rep verified all of my settings were optimal, they pulled a developer onto the call who basically said that these messages dont necessarily indicate a problem in my case. They appear when the latency changes by 20 or 30% (cant remember which). Im seeing messages like latency changed from 1286 microseconds to 24602 microseconds which is still only 24ms. this happens for a second and then it drops back down again so it isnt even that high to begin with and is only for a second. And they confirmed that these messages are new to version 5 so people who only started seeing them after upgrading from 4->5, thats why. Anyway, i wish they would change the logic on these messages so they would only appear if the % changed by a certain amount AND the overall latency was over a certain threshold as well.
After what seems like a 1,000 tests and an eternity, Dell has finally decided to involve the vMware support team. This really does seem like there is a problem with software ISCSI in Esxi 5. The problem has not been resolved in any updates. Like my image from above shows, this is a quick burst of high latency that occurs and it always occurs on the hosts that are not the active path, but the latency can leak over to the active path. I think there is either a bug in monitoring latency (I.e. the monitoring functionality within vMware itself is the cause of the latency) or, I believe, there is a flaw in software ISCSI with with something like locking or pathing that causes the issue.
I would ignore this, except my tests show that even a small burst of latency can cause the Esxi host to disconnect from the datastore, granted it takes a serious workload to get that result.
I will post back with the results of the vMware support testing (hopefully, I have a repeatable environment where they can isolate the problem). I am pulling my hair out - over $100K in equipment has been on hold for months over this.
Thanks for the info. Please keep us updated!
Another FYI: I have now proven that if all the software ISCSI sources have at least one active path, the latency dissappears and all performs well. I get the WORST latency (by far) when the paths are inactive, the hosts are doing essentially nothing. Look at the image below. The latency plunged the moment I started activity on all paths. I can repeat this over and over. The latency to the left occurred with nothing going on with the hosts, they were idle.... then I started iometer with multiple threads on clients of each software ISCSI source (you would think it would get worse) and the latency goes normal.
How are you determining if/when paths are active? Do you have multiple hosts accessing the volume? The graph you show is for a particular host, no? Does it show the same on all hosts?
I have 4 hosts and 2 datastores that are on all 4 hosts. 1 datastore has vm's on it that are moderately busy and the other was just created a few weeks ago and only has a couple vm's on it. Over the last hour (thats all the real-time graph shows) the latency numbers on the almost unused datastore look like:
host 1: max 5ms, average .1ms
host 2: max 5ms, average .033ms
host 3: max 0ms. average 0ms
host 4: max 3ms, average .022ms
These are obviously low usage for this volume and I am not seeing the numbers that you are seeing. Also, your graph doesnt show, but is that read or write latency that is spiking?
Yes, there are multiple (two) hosts, and yes I see the same result on both hosts. I use esxtop to monitor the latency as well as the built-in performance charts. I know which path I will set active, by which datastore contains the VM and choosing the host that will run the VM. I only have this issue on Esxi 5, not 4.1. I have completely distinct software ISCSI sources (I.e. a Dell MD3200i and a FalconStor NSSVA). Both these sources have multiple datastores. Both use differrent physical connections (I.e. different net work cards, different switches, etc). There are no common drivers or physical connections, yet both have the same problem.
Again, if I have all my ISCSI data sources active, then the latency goes to normal. If any path is inactive (no VMs performing any read/write activity), I get bad latency.
So If you were to create a new volume and put nothing on it would you see this behavior? Im trying to replicate it here but so far have been unable. I am only running 5.0 update 1 here, never had 4.x installed.