VMware Cloud Community
bukais
Contributor
Contributor

Problem with my ESX Servers

Hi all:

Today I acces my servers and I found the next error messages, could someone help me, I tried to found something on google, but I could not found nothing

This error is from one PowerEdge R710 with 1 processor and 36 gb on RAM

20:00:00:32.355

cpu1:4257)ScsiDevice IO: 2352: Failed write command to write-quiesced partition

naa.6090a068e013387870cca4cfc275c505:1

This error is from one PowerEdge R710 with 1 processor and 36 gb on RAM

20:00:18:07.916

cpu2:4256)ScsiDevice IO: 2352: Failed write command to write-quiesced partition

naa.6090a068e013387870cca4cfc275c505:1

If someone could help me

Thanks a lot

Bukais

Reply
0 Kudos
10 Replies
xeen3d
Contributor
Contributor

I can't find anything about this, either. Here, the errors came popping up about a month ago. Now they show up every couple of days. often on more than half of the hosts.

I don't really see anything going wrong because of this, but it's irritating, to say the least.

Any insight would be welcome.

Reply
0 Kudos
legisilver
Contributor
Contributor

So have you found anything wrong?  I just checked my logs and the error in the email that was sent to me isn't even there.

Anyone?

Reply
0 Kudos
cnidus
Contributor
Contributor

Hi,

Did you find a solution for this issue? I'm starting to see the same issues in my environment. As best as I can tell, it's not having any effect... but I'd still like to get to the bottom of it.

Thanks

Doug

Douglas Youd Senior Virtualization Engineer zettagrid
Reply
0 Kudos
legisilver
Contributor
Contributor

I haven't seen the issue happen again in my environment.  We are also running Dell's.... Five R610 cluster with 48GB RAM a piece with HA turned on..... I'll come back if anything changes.

Reply
0 Kudos
SCCkwongd201110
Contributor
Contributor

Hey All,

I just started seeing this error on my system as well.  I found some info that points to the need to configure the physical NIC to be bound to the software iSCSI initiator.  You can find this info on page 40 of the iSCSI SAN config guide for esxi 4.1.  Essentially, you have to run the command for each nic:

esxcli --server <esxi server name> swiscsi nic add -n <vmkernel name> -d <sw iscsi name>

This will make iscsi IO more efficient by moving the IO from the network layer up to the protocol layer and make it more efficient.  Hope this works for you.

d.

Reply
0 Kudos
legisilver
Contributor
Contributor

I believe you sir, but we don't use iSCSI.

Reply
0 Kudos
legisilver
Contributor
Contributor

Well, getting these errors again.  I wish someone knew the process well enough to explain what is happening here.

Reply
0 Kudos
melevy
Contributor
Contributor

I'm seeing this as well - with a Fiber Channel SAN. everytime the error occurrs, we get complaints about slow response.

Reply
0 Kudos
legisilver
Contributor
Contributor

Okay guys, here's the deal with my situation and this error.

Basically, we have an XIV backend provisioning volumes to an IBM n6040 (NetApp) which are seen on the netapp as an aggregate volume.  This aggregate is carved into volumes on the NetApp and then Luns are created inside these volumes and presented to our ESXi environment.  Now presented to the VMware HBA's, they are turned into datastores with the VM's placed thereon.

Okay, this error is a result of a SCSI timeout issue.  If you look at your logs you'll probably see timeout errors.  Typically, SCSI can come back with some "sense" data which can usually give a hint as to what the issue is.

In our case, we weren't getting any data.  We'd just see the datastores timeout.  The HBA target doesn't drop, no paths fail, it just times out and doesn't respond.

Again, in OUR case, it turned out to be the NetApp waiting on a storage shelf to process commands for our Exchance environment.  Why you ask?  Because on the NetApp everything has to traverse this interconnect which, duh, connects the two modules and makes it a cluster.  Now, when our new Exchange environment started receiving a lot of data it pushed a lot of reads and writes onto this interconnect.  Although the fiber-connected luns to the ESXi environment weren't exactly alinged on the netapp (part of the problem), this error would occur mostly because when the NetApp couldn't respond to the VMware luns because it was waiting on the ESX3000 storage unit to fill it's interconnect queue.

You see, we bought EXN3000 storage system (disk tray shelf for the netapp which connects via SAS) and it's disks only run at 7200 rpm.  We are pushing wayyyyyy to many read and write ops onto these disk for them to keep up.

As we started loading data onto the exchange disks it then started to affect our ESXi environment.

The problem was always there but didn't manifest itself.

Our solution?

Move the VMware environment to straight XIV until we can better load balance the Exchange disk issues.

If anyone wants more details just PM me.

I have a nice set of commands that can help you diagnose this.

Reply
0 Kudos
BrianHCox
Contributor
Contributor

Any chance you could share these commands that helped you diagnose the issue? We have a similar problem and it would be good to know if we have a similar backend issue.

Thanks,

Brian

Reply
0 Kudos