VMware Cloud Community
ChrisKas10
Enthusiast
Enthusiast

Can't backup or copy some of my VM Clients

Running ESXi 4 on an HP DL380 G5. The datastore for my VMs is on an internal RAID 1+0 array that had some issues last week. First I lost one drive, after hot-swapping it I had two others go to predictive failure. HP determined it was likely a firmware issue and sent me move drives and a backplane just in case... Got all that swapped, the arrays are rebuilt and all the lights are green.

Before the swap I had some issues trying to backup the client VMs that I thought might be related to the hardware challenges. Post swap I'm afraid I'm still having the same issues. I should mention that the client machines appear to all be functioning fine.

Issue 1 is that one of the VM clients (Windows 2003 R2) logs a lot of events when under heavy disk I/O such as backups (lots of de-duping activity). The verification portion of the backup process tends to fail.

Event Type: Error

Event Source: Disk

Event Category: None

Event ID: 15

Date: 7/11/2010

Time: 8:50:39 PM

User: N/A

Computer: EFDEV

Description:

*The device, \Device\Harddisk0, is not ready for access yet.*

I'll get 15 - 20 of those over the span of a couple hours while the backup job is running.

Isssue 2: I can't get a full copy of the vmdk files. I've tried shutting down the machine and using vSphere's data browser, FTP and SCP. In all cases eventually the backup on two of the machine's vmdk stops or times out. I've tried ghettoVCB to a few different NFS targets with the same issue.

Here's a ghettoVCB snippet from the most recent error:

Cloning disk '/vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2.vmdk'...

*Clone: 67% done.Failed to clone disk : Connection timed out (7208969).*

2010-07-12 20:19:05 -- info: Removing snapshot from EFdev-2k3-web2 ...

ls: /vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2-000001.vmdk: No such file or directory

ls: /vmfs/volumes/VMs/EFdev-2k3-web2/EFdev-2k3-web2-000001-delta.vmdk: No such file or directory

2010-07-12 20:19:21 -- info: Backup Duration: 39.03 Minutes

2010-07-12 20:19:21 -- info: Successfully completed backup for EFdev-2k3-web2!

When that error happens, I find stuff like this in the /var/log/messages:

Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050b7780) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...

Jul 12 20:06:43 vmkernel: 3:20:14:38.244 cpu0:8661)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:43 vmkernel: 3:20:14:38.417 cpu0:8661)<4>cciss: cmd 0x4100b1002270 has CHECK CONDITION byte 2 = 0x3

Jul 12 20:06:43 vmkernel: 3:20:14:38.424 cpu0:8661)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005142c80) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:43 vmkernel: 3:20:14:38.424 cpu0:8661)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:44 Hostd: 2010-07-12 20:06:44.761 147CFB90 verbose 'vm:/vmfs/volumes/4b053e54-176e0886-5440-001b784635e0/EFfax/EFfax.vmx' Updating current heartbeatStatus: yellow

Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)<4>cciss: cmd 0x4100b1002000 has CHECK CONDITION byte 2 = 0x3

Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050ef780) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...

Jul 12 20:06:44 vmkernel: 3:20:14:39.588 cpu0:8144)ScsiDeviceIO: 747: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:44 vmkernel: 3:20:14:39.767 cpu0:8675)<4>cciss: cmd 0x4100b1002750 has CHECK CONDITION byte 2 = 0x3

Jul 12 20:06:44 vmkernel: 3:20:14:39.773 cpu0:8675)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050edb80) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 12 20:06:45 Hostd: 2010-07-12 20:06:45.720 75503B90 verbose 'vm:/vmfs/volumes/4b053e54-176e0886-5440-001b784635e0/EFdev/EFdev.vmx' Actual VM overhead: 148062208 bytes

Jul 12 20:06:45 Hostd: 2010-07-12 20:06:45.727 75503B90 verbose 'Vmsvc' RefreshVms updated overhead for 1 VM

Many repititions.

I might theorize I have issues with the vmfs file system?

I'm looking for suggestions on how to proceed. What would you try next to fix this?

Thanks in advance for any suggestions.

0 Kudos
5 Replies
DSTAVERT
Immortal
Immortal

Have a look at the following KB article for starters. http://kb.vmware.com/kb/289902

Do you have the BBWC module installed for the controller and write caching enabled? Firmware up to date for ESXi? Use the specific firmware CD for VMware on the HP download site. Are you using the HP specific version of ESXi? Which build of ESXi?

-- David -- VMware Communities Moderator
0 Kudos
ChrisKas10
Enthusiast
Enthusiast

Have a look at the following KB article for starters.

Do you have the BBWC module installed for the controller and write caching enabled? Firmware up to date for ESXi? Use the specific firmware CD for VMware on the HP download site. Are you using the HP specific version of ESXi? Which build of ESXi?

I'll check the link again soon. My first attempt left me a bit dazed and confused.

As for the other questions..

1) Sorry, what's a BBWC module and how do I know if it is installed write caching enabled?

2) I just noticed that I'm a few months behind on ESXi updates. I'll be fixing that very soon -- perhaps tonight if I can get an outage scheduled.

3) I can't find a specific firmware CD for VMware on the HP download site. I've very recently done Firmware Updates with the Smart Update Firmware DVD version 9.00, for what that's worth. I'm not entirely sure the NICs flashed correctly...

4) Didn't know there was a HP specific ESXi, nor am I using it (Should I be?)

5) Accordingto the Host Update Utility I'm at VMware ESXi 4.0.0 build-208167

Thanks for the input.

0 Kudos
DSTAVERT
Immortal
Immortal

The BBWC is a an HP option for the disk controller that provides a battery backed RAM cache. Write caching "greatly" improves disk controller performance. It is a must in a virtual environment. To check write caching you need to use the Smartstart CD or the ACU (array configuration utility) cd. Do not enable write caching unless the battery module is installed. I don't remember whether the module shows up in the ACU screens.

The software download page for Your particular HP server model lists OS versions for the firmware CD and VMware is listed. The firmware CD is 9 so you should be OK but there are some critical disk controller updates listed that are post the firmware date. I might consider them. I would scan the HP forums for any potential issues with VMware and those updates. Make very sure you use drivers from the VMware OS download page. There are / have been firmware versions for some components that are NOT recommended for specific versions of VMware.

There is a specific version of ESXi for HP servers that you can download from HP. There is a upgrade bundle on the same list as the firmware that gives you the missing components from the HP version of ESXi.

You might want to consider upgrading to ESXi Update 2

-- David -- VMware Communities Moderator
ChrisKas10
Enthusiast
Enthusiast

Thanks for the info. I have BBWC. I have current drivers -- including the controller. In fact, being down-level on that is what seems to have started my drive failures in the first place (it didn't dig a hot-swap).

I updated ESXi this morning. Everything seemed to go OK but now I log the following NEW error at an avg. of 2 a second:

Jul 13 14:19:17 vmkernel: 0:00:29:26.036 cpu6:9311)WARNING: vmklinux26: SCSILinuxQueueCommand: cmd 0x00, cmd_len 12 should be 6

As you might guess, there's so much crap filling /var/log/messages that it is difficult to see what's really going on now.

I've made sure all the VMware Tools on the clients are updated as well. I'll try a backup now to see if anything changed, but I'm not feeling hopeful.

0 Kudos
ChrisKas10
Enthusiast
Enthusiast

The SCSIlinuxQueue issue was a non-issue related to CD ROM drive.

I've updated to the VMware Broadcom NIC drivers.

When I try backing up to a Windows 2003 NFS share it goes nice and fast but times out at about 5 or 6 minutes in.

When I try backing up to an OpenFiler NFS share it starts OK but after a while I get a LOT of these errors:

Jul 14 01:15:21 vmkernel: 0:00:38:50.989 cpu1:8428)ScsiDeviceIO: 770: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Jul 14 01:15:22 vmkernel: 0:00:38:51.188 cpu1:8774)<4>cciss: cmd 0x4100b14024e0 has CHECK CONDITION byte 2 = 0x3

Jul 14 01:15:22 vmkernel: 0:00:38:51.188 cpu1:8774)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41000503c900) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Jul 14 01:15:22 vmkernel: 0:00:38:51.188 cpu1:8774)ScsiDeviceIO: 770: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.

Looking closer, "mpx.vmhba1:C0:T1:L0" is the local datastore that holds the VMs. That's not good!

I'm back to square 1. I don't doubt there's an issue to be fixed, I just don't know what/how to fix it.

0 Kudos