Running ESXi 4 on an HP DL380 G5. The datastore for my VMs is on an internal RAID 1+0 array that had some issues last week. First I lost one drive, after hot-swapping it I had two others go to predictive failure. HP determined it was likely a firmware issue and sent me move drives and a backplane just in case... Got all that swapped, the arrays are rebuilt and all the lights are green.
Before the swap I had some issues trying to backup the client VMs that I thought might be related to the hardware challenges. Post swap I'm afraid I'm still having the same issues. I should mention that the client machines appear to all be functioning fine.
Issue 1 is that one of the VM clients (Windows 2003 R2) logs a lot of events when under heavy disk I/O such as backups (lots of de-duping activity). The verification portion of the backup process tends to fail.
Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 15
Date: 7/11/2010
Time: 8:50:39 PM
User: N/A
Computer: EFDEV
Description:
*The device, \Device\Harddisk0, is not ready for access yet.*
I'll get 15 - 20 of those over the span of a couple hours while the backup job is running.
Isssue 2: I can't get a full copy of the vmdk files. I've tried shutting down the machine and using vSphere's data browser, FTP and SCP. In all cases eventually the backup on two of the machine's vmdk stops or times out. I've tried ghettoVCB to a few different NFS targets with the same issue.
Here's a ghettoVCB snippet from the most recent error:
Cloning disk '/vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2.vmdk'...
*Clone: 67% done.Failed to clone disk : Connection timed out (7208969).*
2010-07-12 20:19:05 -- info: Removing snapshot from EFdev-2k3-web2 ...
ls: /vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2-000001.vmdk: No such file or directory
ls: /vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2-000001-delta.vmdk: No such file or directory
2010-07-12 20:19:21 -- info: Backup Duration: 39.03 Minutes
2010-07-12 20:19:21 -- info: Successfully completed backup for EFdev-2k3-web2!
When that error happens, I find stuff like this in the /var/log/messages:
Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050b7780) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...
Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:43 vmkernel: 3:20:14:38.417 cpu0:8661)<4>cciss: cmd 0x4100b1002270 has CHECK CONDITION byte 2 = 0x3
Jul 12 20:06:43 vmkernel: 3:20:14:38.424 cpu0:8661)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005142c80) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:43 vmkernel: 3:20:14:38.424 cpu0:8661)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:44 Hostd: 2010-07-12 20:06:44.761 147CFB90 verbose 'vm:/vmfs/volumes/4b053e54-176e0886-5440-001b784635e0/EFfax/EFfax.vmx' Updating current heartbeatStatus: yellow
Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)<4>cciss: cmd 0x4100b1002000 has CHECK CONDITION byte 2 = 0x3
Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050ef780) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...
Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:44 vmkernel: 3:20:14:39.767 cpu0:8675)<4>cciss: cmd 0x4100b1002750 has CHECK CONDITION byte 2 = 0x3
Jul 12 20:06:44 vmkernel: 3:20:14:39.773 cpu0:8675)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050edb80) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 12 20:06:45 Hostd: 2010-07-12 20:06:45.720 75503B90 verbose 'vm:/vmfs/volumes/4b053e54-176e0886-5440-001b784635e0/EFdev/EFdev.vmx' Actual VM overhead: 148062208 bytes
Jul 12 20:06:45 Hostd: 2010-07-12 20:06:45.727 75503B90 verbose 'Vmsvc' RefreshVms updated overhead for 1 VM
Many repititions.
I might theorize I have issues with the vmfs file system?
I'm looking for suggestions on how to proceed. What would you try next to fix this?
Thanks in advance for any suggestions.