VMware Cloud Community
vinhdat82
Contributor
Contributor

Host not responding when IO load is high

HI everyone,

I have a few ESXi 5.0 connecting to a NAS iSCSI.

The server is sure to have sufficient for ALL VM to use all allocated memory.

When the load is a bit high (around 100 ms latency) and if the host has swap ( ~ as little as 2MB), the host tends to be not responding:

- Some of the VMs tend to hang. Some of them are fine.

- Port 443 is unable to telnet any more

- ls /vmfs/volumes inside the host hangs.

The remaining hosts have no swap, so no problem.

I have done the following steps:

- /etc/init.d/hostd restart

- /etc/init.d/vpxa restart

- /etc/init.d/wman restart

- esxcfg-rescan vmhba35 (my iscsi) but hit "Error: Unable to scan VMkernel SCSI subsystem for old devices.  Scan already in progress"

Reboot will solve the problem. But I don't want to reboot.

I don't have the direct access to DCUI.

Any help is appreciated.

Thanks so much.

RHCE, VCI
0 Kudos
6 Replies
vinhdat82
Contributor
Contributor

BTW, Storage IO and management IO are to different vdSwitches.

RHCE, VCI
0 Kudos
Sreejesh_D
Virtuoso
Virtuoso

Looks like its a bug in ESXi 5.0

have a look into this blog.

http://vmtoday.com/2012/02/vsphere-5-networking-bug-affects-software-iscsi/

vinhdat82
Contributor
Contributor

I find that server 2 with v5.0.0 update 01 is fine.

It could recover the iscsi session.

Server 1 with v5.0.0 (GA) couldn't recover the iscsi session. So it failed.

I updated server 1 to v5.0.0 update 02

I will know if update 02 mitigate the problem within a day.

RHCE, VCI
0 Kudos
vinhdat82
Contributor
Contributor

After a IO latency spike, the host lost management connection again (not responding).

The remaining recovers the access within 1 second and survives.

Any help to recover iSCSI connection and get the management up again using SSH?

Thanks million

RHCE, VCI
0 Kudos
vinhdat82
Contributor
Contributor

The failed host uses Intel NIC.

The survived host use Broadcom NIC.

1) How to recover iSCSI connection?

2) How to get management up again?

RHCE, VCI
0 Kudos