frantzcollini
Contributor
Contributor

Dead crash of a ESX 3.5 Server

Hello,

i have completely lost my trust on vmware!

This morning, nothing work, i power on a monitor on the Esx 3.5 physical machine and i saw a screen of death, like the windows BSOD.

The relevant information were: LOST HEARTBEAT and Alert HB 365

Before the crash, the vmkernel log said:

Aug 26 22:01:34 vm1 vmkernel: 77:13:19:36.094 cpu0:1134)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:02:05 vm1 vmkernel: 77:13:20:06.255 cpu3:1134)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:02:35 vm1 vmkernel: 77:13:20:36.407 cpu6:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:03 vm1 vmkernel: 77:13:21:05.086 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:03 vm1 vmkernel: 77:13:21:05.117 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:03 vm1 vmkernel: 77:13:21:05.149 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:03 vm1 vmkernel: 77:13:21:05.180 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:03 vm1 vmkernel: 77:13:21:05.211 cpu7:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.243 cpu7:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.273 cpu7:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.305 cpu5:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.336 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.367 cpu4:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

Aug 26 22:03:04 vm1 vmkernel: 77:13:21:05.398 cpu6:1135)WARNING: VSCSI: 5294: WRITE10 past end of virtual device with 41943040 numBlocks, offset 41942975, len

gth 128

repeated lots of times...

The physical ESX machine actually has the 94% of used space in the /vmfs partition, and it brings up about 14 virtual machines.... maybe an overload?

Thank you, cheers!

Francesco

0 Kudos
8 Replies
ShanVMLand
Expert
Expert

Looks like to time to cleanup your VMFS partition. Try to identify which VM causing the error. "WRITE 10" trying to write 10-byte. One or more of your VM trying to write than its Disk range.

Ref this thread: http://communities.vmware.com/thread/81961






If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
0 Kudos
bulletprooffool
Champion
Champion

If you can afford to shut down your VMs and start them one at a time . . or possibly just shut them down one at a time and monitor the logs as you do so, you should be able to identify the offending VMs.

Alternatively, if you have alternat estorage available, try sVMotioning them 1 at a time and monitoring logs to find the culprit?

One day I will virtualise myself . . .
0 Kudos
frantzcollini
Contributor
Contributor

I have looked into the vmkernel.log just some minutes before the crash, but it shows a machine ID which is not listed in vm-support between the existent... and i do not know where this error come from!

Francesco

0 Kudos
rolohm
Enthusiast
Enthusiast

Please don't lose your faith in vmware yet.

Do you have one or more VMs running with active snapshots? If so, that will eventually fill your vmfs partition if it wasn't set up with enough free space to start with. When the partition becomes full, all VMs with snapshots will stop running. (all snapshot files, even for vmdk files on other volumes, will end up in the VMs home directory)

/R

0 Kudos
frantzcollini
Contributor
Contributor

Hello, and thank you for your contribute!

I had active snapshot, but the vmfs partition, yesterday evening at 22:35 (Esx crash time), had about 5% of free space left... could it be this the problem of the crash?

But, a product like Vmware, why do not have some agent that avoids system crash life BSOD?

0 Kudos
rolohm
Enthusiast
Enthusiast

It of course depends on how big your volume is if 5% is a lot or not, but it lookes as if it wasn't enough this time. If you have a VM that's transporting big files, dumps databases or stuff like that that, the snapshot file grows at the speed of those changes and you can actually run out of space muck quicker than you think. I have a rule to keep 20% free space on home directory volumes... which is a number I learned back in my day and has stuck to it. ( I normally set up 200-250 GB "home disks which leaves me 40-50 GB of "snapshot space")

/R

0 Kudos
frantzcollini
Contributor
Contributor

Nothing to do... same crash this late afternoon, too... all vm disconnected and the physical ESX machine was not pingable.

Logs said:

# cat vmkernel.1 | grep WARNING

Aug 27 12:47:44 vm1 vmkernel: 0:03:56:29.915 cpu7:1140)WARNING: VSCSI: 5294: READ10 past end of virtual device with 41943040 numBlocks, offset 41942975, length 128

Aug 27 18:19:25 vm1 vmkernel: 0:09:28:10.686 cpu5:1139)WARNING: VSCSI: 5294: READ10 past end of virtual device with 41943040 numBlocks, offset 41942975, length 128

Aug 28 11:15:40 vm1 vmkernel: 1:02:24:26.155 cpu0:1140)WARNING: VSCSI: 5294: READ10 past end of virtual device with 41943040 numBlocks, offset 41942975, length 128

Aug 28 15:59:28 vm1 vmkernel: 1:07:08:13.778 cpu0:1139)WARNING: VSCSI: 5294: READ10 past end of virtual device with 41943040 numBlocks, offset 41942975, length 128

Aug 28 18:41:50 vm1 vmkernel: 0:00:00:04.219 cpu2:1039)WARNING: SCSI: 279: SCSI device type 0xd is not supported. Cannot create target vmhba0:288:0

Aug 28 18:41:50 vm1 vmkernel: 0:00:00:04.219 cpu2:1039)WARNING: SCSI: 1294: LegacyMP Plugin could not claim path: vmhba0:288:0. Not supported

Aug 28 18:41:50 vm1 vmkernel: 0:00:00:04.219 cpu2:1039)WARNING: ScsiPath: 3187: Plugin 'legacyMP' had an error (Not supported) while claiming path 'vmhba0:C0:T288:L0'.Skipping the path.

I am really worried about!!!

Thank you!

Francesco

0 Kudos
rolohm
Enthusiast
Enthusiast

Hi again. It's weird that evarything stops due to this error.

1. Is maybe your /var/log partition not a separate partition and your root partition gets full due to big log files in /var/log? A full root system is very bad! Check in to this.

2. Have you verified that all your hardware is supported according to the hardware compability lists... and that the latest firmwares are ther?

3. There seems to be a problem with at least a few of your vms having a mismatch between the internal disk size and the .vmdk disk size. I googled my way to a homepage (http://blog.zhaospace.com) saying the following. Hope this can be of help.

ESX WRITE10 error

April 29th, 2009

Recently a WRITE10 error in one of my ESX host caught my attention, and it occurs more than 10 times every second.

Apr 29 12:01:10 cla1011 vmkernel: 11:22:45:34.946 cpu4:1077)WARNING: VSCSI: 5291: WRITE10 past end of virtual device with 29365, length 128

After search on Google and VMware's communities, and still could not find detail information and solution about it, I turned to VMware technical support. The technical support engineer sent me their internal KB.

-


Symptoms

Repeatedly logging messages similar to either of the following in /var/log/vmkernel (or /var/log/messages on ESXi):

Feb 5 15:44:46 USPLVS02 vmkernel: 63:05:31:58.181 cpu3:1129)WARNING: VSCSI: 5292: WRITE10 past end of virtual device with 33554432 numBlocks, offset 33554351, length 128

Feb 12 17:03:04 pa-tse-h02 vmkernel: 156:05:50:47.889 cpu0:1174)WARNING: VSCSI: 3430: READ10 past end of virtual device with 20971520 numBlocks, offset 20980737, length 16

These messages indicate that I/O is being attempted that is outside the boundaries of the virtual device (virtual disk). In layman's terms, the VM has a list of ten items, and the guest OS is asking for the 12th item on the list.

Resolution

These messages indicate that I/O is being attempted that is outside the boundaries of the virtual device (virtual disk). In layman's terms, the VM has a list of ten items, and the guest OS is asking for the 12th item on the list.

To find out which VM is responsible for these, the World ID (WID) must be determined from the log messages. The WID is after the cpu specifier, and before the WARNING in the above messages. In the case of the WRITE10 message above, the WID is 1129; for the READ10 message the WID is 1174.

If you look in /proc/vmware/sched/cpu, then the vcpu column (first one) will list the number identified in the logs.

To determine the VM responsible if it is not running:

cat `ls -rt vmkern*` | less

Find the first instance of the log message (using "/WRITE10″ or "/READ10″ will likely find it for you very well) Then search backwards in the logs for the WID value (in less this can be done with "?". ex: ?1129 Note: It searches beginning just before the top line on screen. Press ‘n' to find the next match.) and keep searching earlier in logs until you find something similar to:

Feb 12 16:51:55 pa-tse-h02 vmkernel: 156:05:39:38.873 cpu2:1173)Sched: vm 1174: 4836: adding ‘vmm0:ProblemVMName': group ‘host/user/pool0′: cpu: shares=2911 min=0 max=-1

The text will show you the name of the problematic VM after the vmm entry. In this case, "adding ‘vmm0:ProblemVMName'" shows that the VM causing the issue is named ProblemVMName.

If you look at the contents of the descriptor file for the offending Virtual Machine's disks, you will find an entry listing the number of cylinders for the virtual disk. As an example:

ddb.geometry.cylinders = "2088″

In this case, the virtual disk has 2088 cylinders. Running "fdisk -l" against the flat file of the virtual disk will return information similar to:

You must set cylinders.

You can do this from the extra functions menu.

Disk ANSGOOD-flat.vmdk: 0 MB, 0 bytes

255 heads, 63 sectors/track, 0 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

ANSGOOD-flat.vmdk1 * 1 2089 16779861 7 HPFS/NTFS

Partition 1 has different physical/logical endings:

phys=(1023, 254, 63) logical=(2088, 254, 63)

Note that in this case, the end value for the partition/disk set in the partition table is 2089, exceeding the number of cylinders set in the descriptor file. If this were proper, it would show 2088 as the end value, instead of 2089. The operating system, as a result of this incorrect partition table

Extending the VMDK just enough to allow it to contain the size of the partition table might fix this, but because it is an invalid conglomeration of settings, it can not be safely assumed that this can be fixed. Other possible means of fixing the behavior is to cause the partition table to fit within the disk, by correcting it's ending value. Moving the data to a new, properly configured disk/partition table/file system is the best bet as the state of the file system by trying to modify the VMDK or partition table isn't known, and may be damaged by the changes, or already is damaged. Give the customer the options, and let them choose how to handle the changes to their system, as they can best judge how they want to protect their data.

-


The solution to it is either extend the VMDK file or shrink the partition table. Extending seems safer than shrink. The solution that I chose was using VMware Converter. By the way, VMware Converter 4 is offering some cool features over the previous version 3.0.3.

0 Kudos