Re: URGENT partition error, 3 VMs in production wi...

aboland · ‎01-28-2008

At server start this error is found every time, in /etc/vmware/hostd.log:

2008-01-28 17:38:01.740 'Vmsvc' 3076461472 info VMServices Plugin initialized

2008-01-28 17:38:01.740 'BlklistsvcPlugin' 3076461472 info Block List Service Plugin started

2008-01-28 17:38:01.740 'DirectorysvcPlugin' 3076461472 info Plugin started

2008-01-28 17:38:01.741 'ha-host' 3076461472 info About:(vim.AboutInfo) {

dynamicType = <unset>,

name = "VMware ESX Server",

fullName = "VMware ESX Server build-64607",

vendor = "VMware, Inc.",

version = "3.5.0",

build = "64607",

localeVersion = <unset>,

localeBuild = <unset>,

osType = "vmnix-x86",

productLineId = "esx",

apiType = "HostAgent",

apiVersion = "2.5.0",

}

2008-01-28 17:38:01.742 'ha-host' 3076461472 info Local swap datastore set to:

2008-01-28 17:38:01.742 'ha-host' 3076461472 info Local swap datastore not found

2008-01-28 17:38:01.775 'Partitionsvc' 3076461472 info InvokePartedUtil /sbin/partedUtil

2008-01-28 17:38:01.902 'Partitionsvc' 3076461472 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

2008-01-28 17:38:01.902 'Partitionsvc' 3076461472 warning Status : 255

Output:*
Error : Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 283638/64/32 (not 36158/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=283638,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).*

Error: Can't have a partition outside the disk!

Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

2008-01-28 17:38:01.926 'HostsvcPlugin' 3076461472 error Unable to obtain active diagnostic partition information:

2008-01-28 17:38:01.926 'HostsvcPlugin' 3076461472 info Plugin started

I've not idea how to solve them, plus four 3 or 4 days de VMs running well, after that this error is in the console monitor:

cpu0:1024)VMNIX: <0>scsi: device set offline command error recovery failed: Host 1 Channel 0 ID 0 LUN 0

Command error recovery failed: <0> journal commit I/O error

after that this error is in each console of Linux VMware (ALT-F1 to F6)

init: cannot execute "/sbin/mingetty"

I/O error: dev 08:02, sector 5777360

INIT: ID "1" respawning too fast: disabled for 5 minutes

Please some help, I'dont find any solution in the manuals or in the web.

Server Hardware:

IBM x3650

RAID 5: 6 disk SAS, 500GB, 7200RPM each one

2 Xeon (4 core)

9 GB of RAM

Thanx in advance

mcowger · ‎01-28-2008

Sounds like you might have a bad system disk...

--Matt

--Matt VCDX #52 blog.cowger.us

aboland · ‎01-29-2008

The 6 disks in the raid are fine, the IBM don't report error.

Chamon · ‎01-29-2008

So the ESX os and the storage for the VMs are all on the same RAID?

aboland · ‎01-29-2008

Yes, in the same RAID 5.

Any problem with that? is bad?

The first ten days de 3 VMs (2 Linux and 1 Windows 2003 R2) running fine.

The 3 VMs in detail:

1) Linux RHEL 4 with Lotus Domino and CA Backup Agent for domino.

2) Linux RHEL 5 with MySQL and Apache, with variete of systems in PHP

3) Windows Server 2003 R2 (with AD, DNS & DHCP for internal network, and File Server and plus Autocad License service)

Chamon · ‎01-29-2008

How much space is left on the vmfs partition?

Chamon · ‎01-29-2008

Can you log into the SC or does it even fully boot up?

aboland · ‎01-29-2008

More data from console:

/root> vdf -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 4.9G 1.3G 3.4G 27% / /dev/sda1 97M 26M 67M 28% /boot none 132M 0 132M 0% /dev/shm /dev/sda6 2.0G 38M 1.8G 3% /var/log /vmfs/devices 4.3T 0 4.3T 0% /vmfs/devices /vmfs/volumes/47629ba9-f8e410a6-1ac3-001a64635138 2.0T 1.2T 782G 61% /vmfs/volumes/storage1 /root> df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 4.9G 1.3G 3.4G 27% / /dev/sda1 97M 26M 67M 28% /boot none 132M 0 132M 0% /dev/shm /dev/sda6 2.0G 38M 1.8G 3% /var/log

Here the error manually generated (the same is at boot time in /etc/vmware/hostd.log, see the first message on top):

/root> partedUtil get /vmfs/devices/disks/vmhba1:0:0:0 Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrect ly. GNU Parted suspects the real geometry should be 283638/64/32 (not 36158/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=283638,64,32 to the comm and line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now). Error: Can't have a partition outside the disk! Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

Here the devices on /vmfs/devices/disks

/vmfs/devices/disks> ls -la total 2440635329 drwxr-xr-x 1 root root 512 Jan 29 18:49 . drwxr-xr-x 1 root root 512 Jan 29 18:49 .. -rw------- 1 root root 2496439255040 Jan 29 18:49 vmhba1:0:0:0 -rw------- 1 root root 104841216 Jan 29 18:49 vmhba1:0:0:1 -rw------- 1 root root 5242880000 Jan 29 18:49 vmhba1:0:0:2 -rw------- 1 root root 2190902034432 Jan 29 18:49 vmhba1:0:0:3 -rw------- 1 root root 2772434944 Jan 29 18:49 vmhba1:0:0:4 -rw------- 1 root root 570408960 Jan 29 18:49 vmhba1:0:0:5 -rw------- 1 root root 2097135616 Jan 29 18:49 vmhba1:0:0:6 -rw------- 1 root root 104841216 Jan 29 18:49 vmhba1:0:0:7

And the same command is on every vmhba1:x:x:1 ---> 7

/vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:1 99 64 32 204768 1 0 204767 0 0 /vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:2 637 255 63 10240000 1 0 10239999 0 0 /vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:3 266362 255 63 4279105536 /vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:4 Warning: Unable to align partition properly. This probably means that another partitioning tool generated an incorrect partit ion table, because it didn't have the correct BIOS geometry. It is usually safe to ignore. Hitting cancel will exit the inst aller. Error: No such file or directory during read on /vmfs/devices/disks/vmhba1:0:0:4 Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:4

/vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:5 543 64 32 1114080 1 0 1114079 0 0 /vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:6 999 128 32 4095968 1 0 4095967 0 0 /vmfs/devices/disks> partedUtil get /vmfs/devices/disks/vmhba1:0:0:7 99 64 32 204768

If you need more info or logs, tell me please.

Thanx

aboland · ‎01-29-2008

When apear this message (in the same color in the monitor connected to the server IBM - ie. on the console)

VMNIX: <0>scsi: device set offline - command error recovery failed: Host 1 Channel 0 ID 0 LUN 0 command error recovery failed: <0> journal commit I/O error

All the mingetty process are down, I can't log in the console.

Previusly, when the system became inestable, few hours before (24 or 48 ) the only repeated mesage is:

I/O error: dev 08:02, sector NNNNNNN

at the console monitor and in dmesg command.

After reboot (previus shutdown VMs if they are running) 4 o 5 days after the sector is diferent, is another one totaly diferent,

again, the system report the I/O error with other sector NNNNNNN

and later the: VMNIX error transcribed on top, and system crash, and sometimes the VMs too.

The IBM BIOS Diagnostic check on every disk in the RAID is fine, is correct, the disk are SAS 500GB (total: six disks in RAID 5, 3TB total capacity)

I'm very very

Chamon · ‎01-29-2008

Could it be that the disk is not alligned? When you partitioned it. When you partition the volume from the VC it is supposed to align the disks but I have heard that this does not always work. In one of the guides (and in class) we had to manualy align the disks as follows. I got this from the service console guide found here:

http://www.rtfm-ed.co.uk/docs/vmwdocs/ESX3.x-VC2.x-ServiceConsole-Guide.pdf

There are two cases when aligning disk partitions:

When aligning raw disks or Raw Device Mapping (RDM) volumes, the alignment is done at the Virtual Machine (VM) level. For example, on Windows VMs use diskpar to perform the alignment.

To align VMFS volumes, the alignment will be done at ESX server level using fdisk and at the VM level. This is because both the ESX Server and the clients will put MBRs on the LUNs. The ESX must align the VMFS volume, and the client systems must align their virtual disks.

To align the ESX server:

1. On service console, execute fdisk /dev/sd is the device on which you would like to create the VMFS.

2. Type "n" to create a new partition.

3. Type "p" to create a primary partition.

4. Type "1" to create partition #1.

5. Select the defaults to use the complete disk.

6. Type "x" to get into expert mode.

7. Type "b" to specify the starting block for partitions.

8. Type "1" to select partition #1.

9. Type "128" to make partition #1 to align on 64KB boundary.

10. Type "r" to return to the main menu.

11. Type "t" to change partition type.

12. Type "1" to select partition #1.

13. Type "fb" to set the type to fb (VMFS volume).

14. Type "w" to write label and the partition information to disk.

By declaring the partition type as fb, the ESX server will recognize the partition as an unformatted VMFS volume. You should be able to put a VMFS file system on it using the MUI or vmkfstools. Next, the virtual disks for each VM must be aligned. For Linux VMs follow the procedure listed above. For Windows VMs, use the procedure for Windows, above.

mcowger · ‎01-29-2008

Disk alignment wont cause errors like this.

Regardless of what the IBM array manager says, you have a failing connection or controller.

--Matt

--Matt VCDX #52 blog.cowger.us

Chamon · ‎01-29-2008

A bad "connection"? Would reseating the drives help? Is that what you mean?

mcowger · ‎01-29-2008

Depends on how they are connected - sounds like these are internal, so I would be looking at the controller for problems.

I'd be running HW diags at this point.

--Matt

--Matt VCDX #52 blog.cowger.us

aboland · ‎02-11-2008

I proved on another server, with exactly the same configuration (6 disks, RAID 5, and few memory, 5GB).

I have installed the VMware from the start, the error is the same, VMware warning:

2008-02-11 04:37:33.746 'Partitionsvc' 61758384 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

I've tried to delete the storage from the client (and see below the error) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

2008-02-11 04:54:17.639 'HostsvcPlugin' 14453680 info DeletePartition: Retrieving disk partition info failed : vim.fault.PlatformConfigFault* *2008-02-11 04:54:17.640 'Hostsvc::DatastoreSystem' 14453680 error DestroyVmfsDatastore: can't delete partition 3 on lun vmhba1:0:0*

With the server in production, the same ocurred when I tried to change the block size, to create VM disks of 500gb, finally I had to do it by hand, with the commands from console.

THIS IS A SERIOUS PROBLEM FOR MY ENTERPRISE in this moment

The complete log is:

2008-02-11 04:37:33.614 'Partitionsvc' 61758384 info InvokePartedUtil /sbin/partedUtil

2008-02-11 04:37:33.746 'Partitionsvc' 61758384 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:37:33.746 'Partitionsvc' 61758384 warning Status : 255

Output:

Error : Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 286838/64/32 (not 36566/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=286838,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

Error: Can't have a partition outside the disk!

Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:38:31.140 'Partitionsvc' 26045360 info InvokePartedUtil /sbin/partedUtil

2008-02-11 04:38:31.270 'Partitionsvc' 26045360 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:38:31.271 'Partitionsvc' 26045360 warning Status : 255

Output:

Error : Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 286838/64/32 (not 36566/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=286838,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

Error: Can't have a partition outside the disk!

Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:52:14.952 'TaskManager' 19020720 info Task Created : haTask-ha-host-vim.host.DatastoreSystem.removeDatastore-27

2008-02-11 04:52:14.953 'Partitionsvc' 19020720 info InvokePartedUtil /sbin/partedUtil

2008-02-11 04:52:15.102 'Partitionsvc' 19020720 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:52:15.102 'Partitionsvc' 19020720 warning Status : 255

Output:

Error : Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 286838/64/32 (not 36566/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=286838,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

Error: Can't have a partition outside the disk!

Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:52:15.102 'HostsvcPlugin' 19020720 info DeletePartition: Retrieving disk partition info failed : vim.fault.PlatformConfigFault

2008-02-11 04:52:15.102 'Hostsvc::DatastoreSystem' 19020720 error DestroyVmfsDatastore: can't delete partition 3 on lun vmhba1:0:0

2008-02-11 04:52:15.102 'Hostsvc::DatastoreSystem' 19020720 warning RemoveDatastore: Failed to remove backend for datastore vmware2:storage1.

2008-02-11 04:52:15.128 'TaskManager' 19020720 info Task Completed : haTask-ha-host-vim.host.DatastoreSystem.removeDatastore-27

2008-02-11 04:52:15.129 'Vmomi' 19020720 info Activation N5Vmomi10ActivationE:0xb415d30 : Invoke done removeDatastore on vim.host.DatastoreSystem:ha-datastoresystem

2008-02-11 04:52:15.130 'Vmomi' 19020720 info Throw vim.fault.PlatformConfigFault

2008-02-11 04:52:15.130 'Vmomi' 19020720 info Result:

(vim.fault.PlatformConfigFault) {

dynamicType = <unset>,

text = "DestroyVmfsDatastore: can't delete partition 3 on lun vmhba1:0:0",

msg = ""

}

2008-02-11 04:54:08.435 'TaskManager' 14453680 info Task Created : haTask-ha-host-vim.HostSystem.enterMaintenanceMode-28

2008-02-11 04:54:08.436 'ha-host' 14453680 info EnterMaintenanceMode, timeout = 0

2008-02-11 04:54:08.436 'ha-host' 14453680 info ModeMgr::Enter: next = maintenance, current = normal, count = 0, timeout = 0

2008-02-11 04:54:08.436 'ha-host' 14453680 info ModeTaskFinished

2008-02-11 04:54:08.436 'TaskManager' 14453680 info Task Completed : haTask-ha-host-vim.HostSystem.enterMaintenanceMode-28

2008-02-11 04:54:17.511 'TaskManager' 14453680 info Task Created : haTask-ha-host-vim.host.DatastoreSystem.removeDatastore-31

2008-02-11 04:54:17.513 'Partitionsvc' 14453680 info InvokePartedUtil /sbin/partedUtil

2008-02-11 04:54:17.639 'Partitionsvc' 14453680 warning Unable to get partition information for /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:54:17.639 'Partitionsvc' 14453680 warning Status : 255

Output:

Error : Error: The partition table on /vmfs/devices/disks/vmhba1:0:0:0 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba1:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 286838/64/32 (not 36566/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba1:0:0:0=286838,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

Error: Can't have a partition outside the disk!

Unable to read partition table for device /vmfs/devices/disks/vmhba1:0:0:0

2008-02-11 04:54:17.639 'HostsvcPlugin' 14453680 info DeletePartition: Retrieving disk partition info failed : vim.fault.PlatformConfigFault

2008-02-11 04:54:17.640 'Hostsvc::DatastoreSystem' 14453680 error DestroyVmfsDatastore: can't delete partition 3 on lun vmhba1:0:0

2008-02-11 04:54:17.640 'Hostsvc::DatastoreSystem' 14453680 warning RemoveDatastore: Failed to remove backend for datastore vmware2:storage1.

2008-02-11 04:54:17.640 'TaskManager' 14453680 info Task Completed : haTask-ha-host-vim.host.DatastoreSystem.removeDatastore-31

2008-02-11 04:54:17.641 'Vmomi' 14453680 info Activation N5Vmomi10ActivationE:0xb412d58 : Invoke done removeDatastore on vim.host.DatastoreSystem:ha-datastoresystem

2008-02-11 04:54:17.642 'Vmomi' 14453680 info Throw vim.fault.PlatformConfigFault

2008-02-11 04:54:17.642 'Vmomi' 14453680 info Result:

(vim.fault.PlatformConfigFault) {

dynamicType = <unset>,

text = "DestroyVmfsDatastore: can't delete partition 3 on lun vmhba1:0:0",

msg = ""

}

2008-02-11 04:54:26.600 'TaskManager' 112589744 info Task Created : haTask-ha-host-vim.HostSystem.exitMaintenanceMode-32

2008-02-11 04:54:26.600 'ha-host' 112589744 info ExitMaintenanceMode, timeout = 0

2008-02-11 04:54:26.600 'ha-host' 112589744 info ModeMgr::Enter: next = normal, current = maintenance, count = 0, timeout = 0

2008-02-11 04:54:26.600 'ha-host' 112589744 info ModeTaskFinished

LoTorre · ‎02-13-2008

Having the same problem just now.

Installed the newest 3.5 ESX version on a brand new IBM 3650, three HDs in Raid5 configuration.

While waiting for my SAN storage to be installed, I decided to create a couple of VMs in the existing Raid5 array.

Been working on it since last week everyday, two flawless Win2003 installations. But today hell is wrecking havoc.

Basically since one hour ago, I can no longer touch my ESX server. It boots, the VMs are configured for autoboot and they start working, but if I try to even telnet or open an URL on them, they just drop like iron.

Funny (not) thing is that I can still access the ESX server via ssh, but the system won't recognize 95% of the bash commands.

Also, I can't access any of the SCSi partitions containing the virtual disks, the exact moment I try that, even the ESX server itself stops answering.

The error shown is:

- CPU0:1024 VMNIX:0SCSi device set offline -command error recovery failed: host1 channel0 id0 lun0

- journal commit i/o error

When I ask the datacenter guy to reboot the server, and he does it, and everything starts working again.

But low and behold, I try accessing (terminal) one of the VMs....bam! All down.

I'm all in for a hardware check, and I've already issued a ticket to IBM to go and check, but seeing this error replicated makes me feel doubtious about all this hardware failing.

Anyway, waiting on feedback from the IBM support tech, I'll post here when I'll have news.

Lorenzo

aboland · ‎02-13-2008

Yes, It's the same problem!! The same error.

The RAID controller model on the IBM x3650 is: ServeRAID 8k

and VMware Server 3.5 ESX support it with this driver: aacraid_esx303

Are you see the Warning about the partition (C/H/S) in the logs ?

aboland · ‎02-13-2008

Yes, it's the same error!!

Another info:

My RAID controler model on IBM x3650 is: ServeRaid 8k

It's supported by VMware Server 3.5 ESX (in theory)

LoTorre · ‎02-14-2008

Confirmed, same Raid controller (ServeRAID 8K) on the x3650 server.

IBM called back, they are gonna change the raid controller with a new one, but it's not guaranteed that that's the problem location.

I'll let you know how the afterwork goes.

ps: both server and controller are of course supported by VMware, IBM itself sold us the whole hw+sw solution, so I really hope (for them) they've not messed up with the compatibility.

aboland · ‎02-14-2008

Some news?

Andes_Support · ‎02-15-2008

We've got the same error, but on a DELL PowerEdge 2950... containing VM's in its local VMFS (local hard disk)

Any idea about this?

It seems it's an ESX issue... rather than Hardware related.

All

URGENT partition error, 3 VMs in production with VM Server ESX 3.5