Mystery IOPS - ESXi 5.5 Update 1

CB_VM · ‎06-15-2014

ESXi 5.5 build 1881737 and vCenter Server 5.5 build 1623101 - We are seeing a lot of read and write IOPs from our ESXi hosts, we have 16 hosts cluster each with 6 LUNs and each LUN is doing between 5-8 read IOPs so we are getting over 1000 unexpected IOPs from the SAN. Only four ESXi hosts have guest and those IOPS are not coming from Guest. All ESXi hosts are generating IOPS and those even don't have any guest. Image attached and you can see 40 reads and 10 writes IOPS and no guests on that server. So 60 x 16 = 960 IOPS on SAN which is not from guests. I have rebooted the whole cluster and also re-scanned for datastores the whole cluster. Already have opened the case with VMWARE and they requested the Cluster reboot (great) didn't worked as they suspected it may be APD or PDL path issue. I checked all hosts manually and there are no dead paths. We are using Dell M1000e Chassis / Dell620m blades and Dell EqualLogic PS6210XS ISCSI array and ESXi is insalled on SD cards. We are using software ISCSI adapter. All the LUNS are seeing IOPS and those LUN don't even have any guests created yet.

Any thoughts how to fix that issue.

JPM300 · ‎06-15-2014

Hey CB_VM

One thing that could case this is the Path Storage Policy (PSP) or the Storage Array Target Plugin is incorrect for that particular model of Array or firmware running on that array. This could be causing pathing issues which could lead to a re-request on general requests. I think it would be odd to see as much IO you are seeing with no guests runngin on the cluster, but a wrong Storage Array Target Plugin or Path Storage Policy can defiantly effect performance.

It looks like the Dell Equallogic you have has two 10GB nics for iSCSI ? Did you setup both in the same vSwitch on the same network? If so did you bind those VMK ports to iSCSI? Dell used to also have there own VIB or installation script that would so most of the iSCSI setup, not sure if that is still happening in ESXi 5.5 but could be something else to look into.

Information on SATP

VMware KB: Changing the default pathing policy for new/existing LUNs

Information on PSP

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101134...

CB_VM · ‎06-16-2014

ISCSI have it's own distributed switch. All got created using Dell Script. We are using Dell_PSP_EQL_Routed as a patch policy which is created by Dell. Yes we do have 2 x 10GB NICs and are on same network.

JPM300 · ‎06-16-2014

Yeah that all looks good.

Go into your iSCSI adapter settings vmhba33 or vmhba38 or whatever its named as and there will be a Network Configuration tab in there. Just double check to make sure the iSCSI nics are binded. Dell has also flip flopped on this setting througout the years as they have there own SATP driver which is the Dell_PSP_EQL to leverage.

Aside from that I would open a case with Vmware/Dell and deffentily ask why so much data is going back and forth.

CB_VM · ‎06-16-2014

Already have tickets open with Dell and VMWARE. VMWWARE requested the cluster reboot which I just did and no luck. Calling them back again (:

JPM300 · ‎06-16-2014

Yeah it looks like you have everything setup correctly.

This is an odd one. Could be an issue with the cluster / Group name and how it splits the traffic or somethign...??? Let us know, i'm curious to what is causing the extra traffic and what the resolution will be.

schepp · ‎06-16-2014

Hi,

how is your iSCSI network configured? All hosts and equallogics with /16 subnet mask like in your screenshot?

Have you enabled and tested 9000 MTU?

Tim

CB_VM · ‎06-16-2014

Yes all /16 and yes we are at 9000MTU. It's not a bandwidth issue.

frankdenneman · ‎06-16-2014

Have you enabled SIOC on the datastores? If so SIOC is accessing each .iormstats.sf file on every datastore. Each hosts in the cluster will do the same to read the averages posted by the other hosts and to posts its averages. This usually leads to a steady state load of 0.5 IOPS per host per LUN. Seeing each LUN doing 5 - 8 IOPS looks like its the SIOC transferring states to one another.

Can you confirm you enabled SIOC?

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

CB_VM · ‎06-16-2014

Storage I/O Control is disabled

frankdenneman · ‎06-16-2014

Looks like its disabled, back to troubleshooting.

BTW I would always check states from the command line. It's not the first time there is a discrepancy in 5.5 between the U.I and the GUI.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

CB_VM · ‎06-16-2014

What is the command to check that?

frankdenneman · ‎06-16-2014

As always, first stop shop is LucD http://www.lucd.info/2010/10/20/automate-sioc/

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

CB_VM · ‎06-16-2014

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=204070...

Found the root cause. It was snmp service causing all those IOPS. All lot of IOPS. As soon I stopped snmp all IOPS disappeared. I think it is because of ESXi is installed on SD cards and they were running out of space and not keeping up with all the traps. Which also caused /tmp/ramdisk to fill up. Lets see what VMWARE support has to say about that.

JPM300 · ‎06-16-2014

Hmmm odd, thanks for the update and resolution. I wouldn't of thought it was SNMP causing it :smileysilly:

CB_VM · ‎06-16-2014

Yup. Strange one. I am down to 200 IOPS from 2000 after stopping snmp service.

CB_VM · ‎06-17-2014

Turned SNMP back ON but excluded disk collection data from out alerting and performance monitoring server and everything is working.