We are running vSphere 4 infrastructure on 14 IBM HS22 Blade servers.Each server has 96 GB Physical RAM.
SAN is IBM DS8000 and HBA used is QLogic 8GB dual Port.
Problem we have been facing is that VMs on only one of the servers, are experirencing very slow performance.
Also it takes a lot of time to vMotion any VM from THIS host to anyother host.
All other Blade servers are performing normally.
What we did for problem isolation is that we created a VM on local hard disk and its performance was satisfactory.
VMs respond very slowly only when their files are on SAN
We are currently running only one VM on this server, have recreated it again and again with different OS.
We have replaced 3 HBAs, change this host to different Blade Slot but in vain
Can anyone help?
I would do one thing. Create a VM on this host.vMotion it to new host observe the performance.If performance is same it is problem with host.
Also you can vmotion one vm on this host and check the performance. Please describle what you mean by VM's are very slow.
Check the DiskIO of the VMs on that host, and check the disk latency of each datastore from that host. Are they high?
Potential saturation points could be the HBA, or the FC switchport, check stats on each. Its a bit confusing but also check the latency from the controller on your Array to try to pin down the location of any bottleneck
I am sorry for late reply since i was away.
Yes, disk latency is very high for VMHBA1, shooting from 100 to 300 and for a LUN
it was shooting up to 3000 value.
vmkernel logs and found the following error messages repeating constantly:-
Oct 10 10:01:04 vmwarehost11 vmkernel: 3:17:26:26.256 cpu1:4519)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.60050768018e82d7180000000000005d" state in doubt; requested fast path state update...
Oct 10 10:01:04 vmwarehost11 vmkernel: 3:17:26:26.256 cpu1:4519)ScsiDeviceIO: 747: Command 0x28 to device "naa.60050768018e82d7180000000000005d" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
--We put the scsi code H:0x2 D:0x0 P:0x0 in our KB article http://kb.vmware.com/kb/1029039 and found that "This status is returned when the HBA driver is unable to issue a command to the device. This status can occur due to dropped FCP frames in the environment."
As told earlier, we have replaced HBA card three times and reinstalled ESX once without any result.
Have you verified the HBA is on the VMware HCL and is at the correct version of firmware and driver if applicable. Also verify the array and its level of firmware ison the VMware HCL
It might be a dodgy switchport or switch. Check the network switch for any errors. Can you do a FCping to the array? Try changing ports end to end. Try changing the cables.