VMware Cloud Community
CB_VM
Contributor
Contributor

Mystery IOPS - ESXi 5.5 Update 1

ESXi 5.5 build 1881737 and vCenter Server 5.5 build 1623101  - We are seeing a lot of read and write IOPs from our ESXi hosts, we have 16 hosts cluster each with 6 LUNs and each LUN is doing between 5-8 read IOPs so we are getting over 1000 unexpected IOPs from the SAN. Only four ESXi hosts have guest and those IOPS are not coming from Guest.  All ESXi hosts are generating IOPS and those even don't have any guest.  Image attached and you can see 40 reads and 10 writes IOPS and no guests on that server. So 60 x 16 = 960 IOPS on SAN which is not from guests. I have rebooted the whole cluster and also re-scanned for datastores the whole cluster.  Already have opened the case with VMWARE and they requested the Cluster reboot (great) didn't worked as they suspected it may be  APD or PDL path issue. I checked all hosts manually and there are no dead paths.  We are using Dell M1000e Chassis / Dell620m blades and Dell EqualLogic PS6210XS ISCSI array and ESXi is insalled on SD cards. We are using software ISCSI adapter.  All the LUNS are seeing IOPS and those LUN don't even have any guests created yet. 

Any thoughts how to fix that issue.

Reply
0 Kudos
16 Replies
JPM300
Commander
Commander

Hey CB_VM

One thing that could case this is the Path Storage Policy (PSP) or the Storage Array Target Plugin is incorrect for that particular model of Array or firmware running on that array.  This could be causing pathing issues which could lead to a re-request on general requests.  I think it would be odd to see as much IO you are seeing with no guests runngin on the cluster, but a wrong Storage Array Target Plugin or Path Storage Policy can defiantly effect performance.

It looks like the Dell Equallogic you have has two 10GB nics for iSCSI ?  Did you setup both in the same vSwitch on the same network?  If so did you bind those VMK ports to iSCSI?  Dell used to also have there own VIB or installation script that would so most of the iSCSI setup, not sure if that is still happening in ESXi 5.5 but could be something else to look into.

Information on SATP

VMware KB: Changing the default pathing policy for new/existing LUNs

Information on PSP

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101134...

Reply
0 Kudos
CB_VM
Contributor
Contributor

ISCSI have it's own distributed switch.  All got created using Dell Script.  We are using Dell_PSP_EQL_Routed as a patch policy which is created by Dell.  Yes we do have 2 x 10GB NICs and are on same network.1.JPG

Reply
0 Kudos
JPM300
Commander
Commander

Yeah that all looks good.


Go into your iSCSI adapter settings vmhba33 or vmhba38 or whatever its named as and there will be a Network Configuration tab in there.  Just double check to make sure the iSCSI nics are binded.  Dell has also flip flopped on this setting througout the years as they have there own SATP driver which is the Dell_PSP_EQL to leverage.

Aside from that I would open a case with Vmware/Dell and deffentily ask why so much data is going back and forth.

Reply
0 Kudos
CB_VM
Contributor
Contributor

Already have tickets open with Dell and VMWARE.  VMWWARE requested the cluster reboot which I just did and no luck.  Calling them back again (:

2.JPG

Reply
0 Kudos
JPM300
Commander
Commander

Yeah it looks like you have everything setup correctly.

This is an odd one.  Could be an issue with the cluster / Group name and how it splits the traffic or somethign...??? Let us know, i'm curious to what is causing the extra traffic and what the resolution will be.

Reply
0 Kudos
schepp
Leadership
Leadership

Hi,

how is your iSCSI network configured? All hosts and equallogics with /16 subnet mask like in your screenshot?

Have you enabled and tested 9000 MTU?

Tim

Reply
0 Kudos
CB_VM
Contributor
Contributor

Yes all /16 and yes we are at 9000MTU.  It's not a bandwidth issue.

3.JPG

Reply
0 Kudos
frankdenneman
Expert
Expert

Have you enabled SIOC on the datastores? If so SIOC is accessing each .iormstats.sf file on every datastore. Each hosts in the cluster will do the same to read the averages posted by the other hosts and to posts its averages. This usually leads to a steady state load of 0.5 IOPS per host per LUN. Seeing each LUN doing 5 - 8 IOPS looks like its the SIOC transferring states to one another.

Can you confirm you enabled SIOC?

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
CB_VM
Contributor
Contributor

Storage I/O Control is disabled

4.JPG

Reply
0 Kudos
frankdenneman
Expert
Expert

Looks like its disabled, back to troubleshooting.

BTW I would always check states from the command line. It's not the first time there is a discrepancy in 5.5 between the U.I and the GUI. 

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
CB_VM
Contributor
Contributor

What is the command to check that?

Reply
0 Kudos
frankdenneman
Expert
Expert

As always, first stop shop is LucD Smiley Happy http://www.lucd.info/2010/10/20/automate-sioc/

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
CB_VM
Contributor
Contributor

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=204070...

Found the root cause.  It was snmp service causing all those IOPS.  All lot of IOPS.  As soon I stopped snmp all IOPS disappeared.  I think it is because of ESXi is installed on SD cards and they were running out of space and not keeping up with all the traps.  Which also caused /tmp/ramdisk to fill up.  Lets see what VMWARE support has to say about that.

Reply
0 Kudos
JPM300
Commander
Commander

Hmmm odd, thanks for the update and resolution.  I wouldn't of thought it was SNMP causing it :smileysilly:

Reply
0 Kudos
CB_VM
Contributor
Contributor

Yup.  Strange one.  I am down to 200 IOPS from 2000 after stopping snmp service.

12.JPG

Reply
0 Kudos
CB_VM
Contributor
Contributor

Turned SNMP back ON but excluded disk collection data from out alerting and performance monitoring server and everything is working. 

Reply
0 Kudos